Scalability

This page describes best practices for creating, configuring, and operating Google Distributed Cloud clusters to accommodate workloads that approach Kubernetes scalability limits.

Scalability limits

Take the following limits into account when designing your applications on Google Distributed Cloud:

Understanding limits

Since Google Distributed Cloud is a complex system with a large integration surface, cluster scalability involves many interrelated dimensions. For example, Google Distributed Cloud can scale through the number of nodes, Pods, or Services. Stretching more than one dimension at the time can cause problems even in smaller clusters. For example, scheduling 110 Pods per node in a 500 node cluster can overstretch the number of Pods, Pods per node, and nodes.

See Kubernetes Scalability thresholds for more details.

Scalability limits are also sensitive to the vSphere configuration and hardware your cluster is running on. These limits are verified in an environment that is likely different than yours. Therefore, you may not reproduce the exact numbers when the underlying environment is the limiting factor.

Preparing to scale

As you prepare to scale your admin clusters or user clusters, consider the following requirements and limitations.

CPU, memory, and storage requirements

See the CPU, RAM, and storage requirements for each individual VM.

Disk and network I/O requirements

Data-intensive workloads and certain control plane components are sensitive to disk and network I/O latency. For example, 500 sequential IOPS (for example, a typical local SSD or a high performance virtualized block device) is typically needed for etcd's performance and stability in a cluster with dozens of nodes and thousands of Pods.

Node IP address

Each Google Distributed Cloud node requires one DHCP or statically-assigned IP address.

For example, 308 IP addresses are needed in a setup with one non-HA user cluster with 50 nodes and one HA user cluster with 250 nodes.

The following table shows the breakdown of the IP addresses:

Node type Number of IP addresses
Admin cluster control-plane VM 1
Admin cluster addon node VMs 3
User cluster 1 (non-HA) control-plane VM 1
User cluster 1 node VMs 50
User cluster 2 (HA) control-plane VMs 3
User cluster 2 node VMs 250
Total 308

Running many user clusters in an admin cluster

As you prepare to run many user clusters in an admin cluster, perform the following steps when creating the admin cluster.

Pod CIDR block in the admin cluster

The Pod CIDR block is the CIDR block for all Pods in an admin cluster. It is configured via the network.podCIDR field in the admin-cluster.yaml.

From this range, smaller /24 blocks are assigned to each node. If you need an N node admin cluster, you must ensure this block is large enough to support N /24 blocks.

The following table describes the maximum number of nodes supported by different Pod CIDR block sizes:

Pod CIDR block size Max number of nodes supported
/18 64
/17 128
/16 256
/15 512

The default Pod CIDR block of an admin cluster is 192.168.0.0/16, which supports 256 nodes.

In an admin cluster with 50 HA user clusters (each user cluster has 3 control plane nodes), there are 1 admin cluster control plane node, 2 admin cluster add-on nodes, and 150 user cluster control plane nodes. The total number of nodes is below 256. Therefore, the default Pod CIDR block can support up to 50 HA user clusters.

Service CIDR block in the admin cluster

The Service CIDR block is the CIDR block for all Services in an admin cluster. It is configured via the network.serviceCIDR field in the admin-cluster.yaml.

The following table describes the maximum number of Services supported by different Service CIDR block sizes:

Service CIDR block size Max number of Services supported
/24 256
/23 512
/22 1,024

The default value is 10.96.232.0/24, which supports 256 services.

Each user cluster uses 5 Services, and the admin cluster control plane uses 13 Services. Therefore, in order to run 50 user clusters, you must change the Service CIDR block in the admin cluster to use a /23 range.

Cloud Logging and Cloud Monitoring

Cloud Logging and Cloud Monitoring help you track your resources.

The CPU and memory usage of the logging and monitoring components deployed in an admin cluster scale according to the number of user clusters.

The following table describes the amount of admin cluster node CPU and memory required to run a large number of user clusters:

Number of user clusters Admin cluster node CPU Admin cluster node memory
0 to 10 4 CPUs 16GB
11 to 20 4 CPUs 32GB
20 to 50 4 CPUs 90GB

For example, if there are 2 admin cluster nodes and each has 4 CPUs and 16GB memory, you can run 0 to 10 user clusters. To create more than 20 user clusters, you must first resize the admin cluster node memory from 16GB to 90GB.

To change the admin cluster add-on node's memory, edit the MachineDeployment configuration:

  1. Run the following command:

    kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG edit machinedeployment gke-admin-node

    where ADMIN_CLUSTER_KUBECONFIG is the path of the kubeconfig file for your admin cluster.

  2. Change the spec.template.spec.providerSpec.value.machineVariables.memory field to 32768.

  3. Save the edit. The admin cluster nodes are recreated with 32GB memory.

GKE Hub

By default, you can register a maximum of 15 user clusters. To register more clusters in GKE Hub, you can submit a request to increase your quota in the Google Cloud console.

Running many nodes and pods in a user cluster

As you prepare to run many nodes and pods in a user cluster, perform the following steps when creating the user cluster.

Pod CIDR block in the user cluster

The Pod CIDR block is the CIDR block for all Pods in a user cluster. It is configured via the network.podCIDR field in the user-cluster.yaml.

From this range, a smaller /24 block is assigned to each node. If you need an N nodes cluster, you must ensure this block is large enough to support N /24 blocks.

The following table describes the maximum number of nodes supported by different Pod CIDR block sizes:

Pod CIDR block size Max number of nodes supported
/18 64
/17 128
/16 256
/15 512

The default Pod CIDR block is 192.168.0.0/16, which supports 256 nodes. For example, to create a cluster with 500 nodes, you must change the Pod CIDR block in the user cluster to use a /15 range.

Service CIDR block in the user cluster

The Service CIDR block is the CIDR block for all Services in a user cluster. It is configured via the network.serviceCIDR field in the user-cluster.yaml.

The following table describes the maximum number of Services supported by different Service CIDR block sizes:

Service CIDR block size Max number of Services supported
/21 2,048
/20 4,096
/19 8,192
/18 16,384

The default value is 10.96.0.0/20, which supports 4,096 services. The default Service CIDR block allows you to create a cluster with 500 Services, which is the maximum number of LoadBalancer type Services Google Distributed Cloud supports in a user cluster.

User cluster control plane nodes

The memory usage of the user cluster control plane components scale according to the number of nodes in the user cluster.

The following table describes the amount of admin cluster node CPU and memory required to run a large number of user clusters:

Number of nodes User cluster control plane node CPU User cluster control plane node memory
0 to 250 4 CPUs 8GB
250 to 500 4 CPUs 16GB

For example, to create more than 250 nodes in a user cluster, you must use user cluster control plane nodes with at least 16GB of memory.

The user cluster control plane node spec can be changed via the masterNode field in the user-cluster.yaml.

Dataplane v2

For 500-node user clusters using Dataplane V2, we recommend 120 GB of memory and 32 CPU cores for the user cluster control plane nodes.

Cloud Logging and Cloud Monitoring

Cloud Logging and Cloud Monitoring help you track your resources.

The CPU and memory usage of the in-cluster agents deployed in a user cluster scale in terms of the number of nodes and Pods in a user cluster.

Cloud logging and monitoring components such as prometheus-server and stackdriver-prometheus-sidecar have different CPU and memory resource usage based on the number of nodes and the number of Pods. Before you scale up your cluster, set the resources request and limit according to the estimated average usage of these components. The following table shows estimates for the average amount of usage for each component:

Number of nodes Container name Estimated CPU usage Estimated memory usage
0 pods/node 30 pods/node 0 pods/node 30 pods/node
3 to 50 prometheus-server 100m 390m 650M 1.3G
stackdriver-prometheus-sidecar 100m 340m 1.5G 1.6G
51 to 100 prometheus-server 160m 500m 1.8G 5.5G
stackdriver-prometheus-sidecar 200m 500m 1.9G 5.7G
101 to 250 prometheus-server 400m 2500m 6.5G 16G
stackdriver-prometheus-sidecar 400m 1300m 7.5G 12G
250 to 500 prometheus-server 1200m 2600m 22G 25G
stackdriver-prometheus-sidecar 400m 2250m 65G 78G

Make sure you have nodes big enough to schedule the Cloud Logging and Cloud Monitoring components. One way to do this is to create a small cluster first, edit the Cloud Logging and Cloud Monitoring component resources according to the table above, create a node pool to accommodate the components, and then gradually scale up the cluster to a larger size.

You can choose to maintain a node pool just big enough for the monitoring and logging components to prevent other pods from being scheduled to the node pool. To do this, you must add the following taints to the node pool:

taints:
  - effect: NoSchedule
    key: node-role.gke.io/observability

This prevents other components from being scheduled on the node pool and prevents user workloads from being evicted due to the monitoring components' resource consumption.

Load Balancer

The Services discussed in this section refer to the Kubernetes Services with type LoadBalancer.

There is a limit on the number of nodes in your cluster, and on the number of Services that you can configure on your load balancer.

For bundled load balancing (Seesaw), there is also a limit on the number of health checks. The number of health checks depends on the number of nodes and the number of traffic local Services. A traffic local Service is a Service that has its externalTrafficPolicy set to Local.

The following table describes the maximum number of Services, nodes, and health checks for Bundled load balancing (Seesaw) and Integrated load balancing (F5):

Bundled load balancing (Seesaw) Integrated load balancing (F5)
Max Services 500 250 2
Max nodes 500 250 2
Max health checks N + (L * N) <= 10K, where N is the number of nodes, and L is the number of traffic local services 1 N/A 2

1 For example, suppose you have 100 nodes and 99 traffic local Services. The number of health checks is 100 + (99 * 100) = 10,000, which is within the 10K limit.

2 Consult F5 for more information. This number is affected by factors such as your F5 hardware model number, virtual instance CPU/memory, and licenses.

Auto-scaling system components

Google Distributed Cloud automatically scales the system components in user clusters according to the number of nodes without any need for you to change configurations. You can use the information in this section for resource planning.

  • Google Distributed Cloud automatically performs vertical scaling by scaling the CPU and memory requests/limits of the following system components using addon-resizer:

    • kube-state-metrics is a Deployment running on cluster worker nodes that listens to the Kubernetes API server and generates metrics about the state of the objects. The CPU and memory request and limits scale based on the number of nodes.

      The following table describes the resource requests/limits set by the system, given the number of nodes in a cluster:

      Number of nodes Approximate1 CPU request/limit (milli) Approximate1 memory request/limit (Mi)
      3 to 5 105 110
      6 to 500 100 + num_nodes 100 + (2 * num_nodes)

      1 There is a margin of +-5% to reduce the number of component restarts during scaling.

      For example, in a cluster with 50 nodes, the CPU request/limit are set to 150m/150m and the memory request/limit are set to 200Mi/200Mi. In a cluster with 250 nodes, the CPU request/limit are set to 350m/350m and the memory request/limit are set to 600Mi.

    • metrics-server is a deployment running on cluster worker nodes that is used by Kubernetes built-in autoscaling pipelines. The CPU and memory request and limits scale based on the number of nodes.

  • Google Distributed Cloud automatically performs horizontal scaling in both admin clusters and user clusters by scaling the number of replicas of the following system components:

    • kube-dns is the DNS solution used for service discovery in Google Distributed Cloud. It runs as a Deployment on user cluster worker nodes. Google Distributed Cloud automatically scales the number of replicas according to the number of nodes and CPU cores in the cluster. With each addition/deletion of 16 nodes or 256 cores, 1 replica is increased/decreased. If you have a cluster of N nodes and C cores, you can expect max(N/16, C/256) replicas. Note that kube-dns supports 1500 concurrent requests per second.

    • calico-typha is a component for supporting Pod networking in Google Distributed Cloud. It runs as a Deployment on user cluster worker nodes. Google Distributed Cloud automatically scales the number of calico-typha replicas based on the number of nodes in the cluster:

      Number of nodes (N) Number of calico-typha replicas
      N = 1 1
      1 < N < 200 2
      N >= 200 3 or more

    • ingress-gateway/istio-pilot are the components for supporting cluster ingress, and they run as a Deployment on user cluster worker nodes. Depending on the amount of traffic ingress-gateway is handling, Google Distributed Cloud uses Horizontal Pod Autoscaler to scale the number of replicas based on their CPU usage, with a minimum of 2 replicas and maximum of 5 replicas.

Best practices

This section describes best practices for scaling your resources.

Scale your cluster in stages

The creation of a Kubernetes node involves cloning the node OS image template into a new disk file, which is an I/O-intensive vSphere operation. There is no I/O isolation between the clone operation and workload I/O operations. If there are too many nodes being created at the same time, the clone operations take a long time to finish and may affect the performance and stability of the cluster and existing workloads.

Ensure that the cluster is scaled in stages depending on your vSphere resources. For example, to resize a cluster from 3 to 500 nodes, consider scaling it in stages from to 150 to 350 to 500, which helps reduce the load on vSphere infrastructure.

Optimize etcd disk I/O performance

etcd is a key value store used as Kubernetes' backing store for all cluster data. Its performance and stability are critical to a cluster's health and are sensitive to disk and network I/O latency.

  • Optimize the I/O performance of the vSphere datastore used for the control-plane VMs by following these recommendations:

  • Latency of a few hundreds of milliseconds indicates a bottleneck on the disk or network I/O and may result in an unhealthy cluster. Monitor and set alert thresholds for the following etcd I/O latency metrics:

    • etcd_disk_backend_commit_duration_seconds
    • etcd_disk_wal_fsync_duration_seconds

Optimize node boot disk I/O performance

Pods use ephemeral storage for their internal operations, like saving temporary files. Ephemeral storage is consumed by the container's writable layer, logs directory, and emptyDir volumes. Ephemeral storage comes from the Google Distributed Cloud node's file system, which is backed by the boot disk of the node.

As there is no storage I/O isolation on Kubernetes nodes, applications that consume extremely high I/O on their ephemeral storage can potentially cause node instability by starving system components like Kubelet and the docker daemon of resources.

Ensure that the I/O performance characteristics of the datastore on which the boot disks are provisioned can provide the right performance for the application's use of ephemeral storage and logging traffic.

Monitor physical resource contention

Be aware of vCPU to pCPU ratios and memory overcommitment. A suboptimal ratio or memory contention at the physical hosts may cause VM performance degradation. You should monitor the physical resource utilization at host level and allocate enough resources to run your large clusters.