Guidelines for creating scalable clusters

This document provides guidance to help you decide how to create, configure, and operate Google Kubernetes Engine (GKE) clusters which can accommodate workloads that approach the known limits of Kubernetes.

What is scalability?

In a Kubernetes cluster, scalability refers to the ability of the cluster to grow while staying within its service-level objectives (SLOs). Kubernetes also has its own set of SLOs

Kubernetes is a complex system, and its ability to scale is determined by multiple factors. Some of these factors include the type and number of nodes in a node pool, the types and numbers of node pools, the number of Pods available, how resources are allocated to Pods, and the number of Services or backends behind a Service.

Preparing for availability

Choosing a regional or zonal control plane

Due to architectural differences, regional clusters are better suited for high availability. Regional clusters have multiple control planes across multiple compute zones in a region, while zonal clusters have one control plane in a single compute zone.

If a zonal cluster is upgraded, the control plane VM experiences downtime during which the Kubernetes API is not available until the upgrade is complete.

In regional clusters, the control plane remains available during cluster maintenance like rotating IPs, upgrading control plane VMs, or resizing clusters or node pools. When upgrading a regional cluster, two out of three control plane VMs are always running during the rolling upgrade, so the Kubernetes API is still available. Similarly, a single-zone outage won't cause any downtime in the regional control plane.

However, the more highly available regional clusters come with certain trade-offs:

  • Changes to the cluster's configuration take longer because they must propagate across all control planes in a regional cluster instead of the single control plane in zonal clusters.

  • You might not be able to create or upgrade regional clusters as often as zonal clusters. If VMs cannot be created in one of the zones, whether from a lack of capacity or other transient problem, clusters cannot be created or upgraded.

Due to these trade-offs, zonal and regional clusters have different use cases:

  • Use zonal clusters to create or upgrade clusters rapidly when availability is less of a concern.
  • Use regional clusters when availability is more important than flexibility.

Carefully select the cluster type when you create a cluster because you cannot change it after the cluster is created. Instead, you must create a new cluster then migrate traffic to it. Migrating production traffic between clusters is possible but difficult at scale.

Choosing multi-zonal or single-zone node pools

To achieve high availability, the Kubernetes control plane and its nodes need to be spread across different zones. GKE offers two types of node pools: single-zone and multi-zonal.

To deploy a highly available application, distribute your workload across multiple compute zones in a region by using multi-zonal node pools which distribute nodes uniformly across zones.

If all your nodes are in the same zone, you won't be able to schedule Pods if that zone becomes unreachable. Using multi-zonal node pools has certains trade-offs:

  • GPUs are available only in specific zones. It may not be possible to get them in all zones in the region.

  • Round-trip latency between zones within a single region might be higher than that between resources in a single zone. The difference should be immaterial for most workloads.

  • The price of egress traffic between zones in the same region is available on the Compute Engine pricing page.

Preparing to scale

Base infrastructure

Kubernetes workloads require networking, compute, and storage. You need to provide enough CPU and memory to run Pods. However, there are more parameters of underlying infrastructure that can influence performance and scalability of a GKE cluster.

Cluster networking

Using a VPC-native cluster is the networking default and the recommended choice for setting up new GKE clusters. VPC-native clusters allow for larger workloads, higher number of nodes, and some other advantages.

In this mode, the VPC network has a secondary range for all Pod IP addresses. Each node is then assigned a slice of the secondary range for its own Pod IP addresses. This allows the VPC network to natively understand how to route traffic to Pods without relying on custom routes. A single VPC network can have up to 15,000 VMs. To learn more about the consequences of this limit, see the VMs per VPC network limit section.

Another approach, which is deprecated and supports no more than 1,500 nodes, is to use a routes-based cluster. A routes-based cluster is not a good fit for large workloads. It consumes VPC routes quota and lacks other benefits of the VPC-native networking. It works by adding a new custom route to the routing table in the VPC network for each new node.

Private clusters

In regular GKE clusters, all nodes have public IP addresses. In private clusters, nodes only have internal IP addresses to isolate nodes from inbound and outbound connectivity to the Internet. GKE uses VPC network peering to connect VMs running the Kubernetes API server with the rest of the cluster. This allows higher throughput between GKE control planes and nodes, as traffic is routed using private IP addresses.

Using private clusters has the additional security benefit that nodes are not exposed to the Internet.

Cluster load balancing

GKE Ingress and Cloud Load Balancing configure and deploy load balancers to expose Kubernetes workloads outside the cluster and also to the public internet. The GKE Ingress and Service controllers deploy objects such as forwarding rules, URL maps, backend services, network endpoint groups, and more on behalf of GKE workloads. Each of these resources has inherent quotas and limits and these limits also apply in GKE. When any particular Cloud Load Balancing resource has reached its quota, it will prevent a given Ingress or Service from deploying correctly and errors will appear in the resource's events.

The following table describes the scaling limits when using GKE Ingress and Services:

Load balancer Node limit per cluster
Internal TCP/UDP load balancer
Network load balancer 1000 nodes per zone
External HTTP(S) load balancer
Internal HTTP(S) load balancer No node limit

If you need to scale further, contact your Google Cloud sales team to increase this limit.


Service discovery in GKE is provided through kube-dns which is a centralized resource to provide DNS resolution to Pods running inside the cluster. This can become a bottleneck on very large clusters or for workloads which have high request load. GKE automatically autoscales kube-dns based on the size of the cluster to increase its capacity. When this capacity is still not enough, GKE offers distributed, local resolution of DNS queries on each node with NodeLocal DNSCache. This provides a local DNS cache on each GKE node which answers queries locally, distributing the load and providing faster response times.

Managing IPs in VPC-native clusters

A VPC-native cluster uses the primary IP range for nodes and two secondary IP ranges for Pods and Services. The maximum number of nodes in VPC-native clusters can be limited by available IP addresses. The number of nodes is determined by both the primary range (node subnet) and the secondary range (Pod subnet). The maximum number of Pods and Services is determined by the size of the cluster's secondary ranges, Pod subnet and Service subnet, respectively.

By default:

  • The Pod secondary range defaults to /14 (262,144 IP addresses).
  • Each node has /24 range assigned for its Pods (256 IP addresses for its Pods).
  • The node's subnet is /20 (4092 IP addresses).

However, there must be enough addresses in both ranges (node and Pod) to provision a new node. With defaults, only 1024 can be created due to the number of Pod IPs.

By default there can be a maximum of 110 Pods per node, and each node in the cluster has allocated /24 range for its Pods. This results in 256 Pod IPs per node. By having approximately twice as many available IP addresses as possible Pods, Kubernetes is able to mitigate IP address reuse as Pods are added to and removed from a node. However, for certain applications which plan to schedule a smaller number of Pods per node, this is somewhat wasteful. The Flexible Pod CIDR feature allows per-node CIDR block size for Pods to be configured and use fewer IP addresses.

If you need more IP addresses than are available in the private space of with RFC 1918, then we recommend using Non-RFC 1918 addresses.

By default, the secondary range for Services is set to /20 (4,096 IP addresses), limiting the number of Services in the cluster to 4096.

Configuring nodes for better performance

GKE nodes are regular Google Cloud virtual machines. Some of their parameters, for example the number of cores or size of disk, can influence how GKE clusters perform.

Reducing Pod initialization times

You can use Image streaming to stream data from eligible container images as your workloads request them, which leads to faster initialization times.

Egress traffic

In Google Cloud, the machine type and the number of cores allocated to the instance determine its network capacity. Maximum egress bandwidth varies from 1 to 32 Gbps, while the maximum egress bandwidth for default e2-medium-2 machines is 2 Gbps. For details on bandwidth limits, see Shared-core machine types.

IOPS and disk throughput

In Google Cloud, the size of persistent disks determines the IOPS and throughput of the disk. GKE typically uses Persistent Disks as boot disks and to back Kubernetes' Persistent Volumes. Increasing disk size increases both IOPS and throughput, up to certain limits.

Each persistent disk write operation contributes to your virtual machine instance's cumulative network egress cap. Thus IOPS performance of disks, especially SSDs, also depend on the number of vCPUs in the instance in addition to disk size. Lower core VMs have lower write IOPS limits due to network egress limitations on write throughput.

If your virtual machine instance has insufficient CPUs, your application won't be able to get close to IOPS limit. As a general rule, you should have one available CPU for every 2000-2500 IOPS of expected traffic.

Workloads that require high capacity or large numbers of disks need to consider the limits of how many PDs can be attached to a single VM. For regular VMs, that limit is 128 disks with a total size of 64 TB, while shared-core VMs have a limit of 16 PDs with a total size of 3 TB. Google Cloud enforces this limit, not Kubernetes.

Limits and quotas relevant for scalability

VMs per VPC Network limit

A single VPC network can have up to 15,000 connected VMs. VPC-native cluster can have almost 15,000 nodes - "almost", because a small number of addresses must be reserved for GKE control plane needs.

Note that if you have configured network peering, the limit of 15,000 VMs will by default apply across peered networks.

Compute Engine API quota

GKE control planes use Compute Engine API to discover metadata of nodes in the cluster or to attach persistent disks to VM instances.

By default, Compute Engine API allows 1,500 requests per minute per project for most read requests, but this limit may not be sufficient for clusters with more than 100 nodes.

If you are planning to create a cluster with more than 100 nodes, we recommend increasing the Read requests per minute quota to at least 3,000. You can change this quota and verify current usage in the Google Cloud console.

Larger clusters need even larger quota. Clusters with 1,000 nodes need approximately 7,500 read requests per minute per project. Clusters with 5,000 nodes need approximately 15,000 read requests per minute per project.

For more information about Compute Engine API limits, see API rate limits.

Logging and monitoring quota

The quotas for the Cloud Logging API (default is 120,000 log insertion requests per minute) and the Cloud Monitoring API (default is 6,000 time series insertion rpm) may need to be increased as your cluster grows.

Time series insertion rate is related to your cluster size. Clusters with 100 nodes need around 30,000 time series insertion rpm. Clusters with 5,000 nodes need on the order of 100,000 rpm.

Log insertion rate is dependent on the amount of logs generated by the pods in your cluster. Your quota might need to increase based on the number of pods and applications logging verbosity.

You can change these settings and verify current usage in the Google Cloud console. You could estimate needed quota using smaller cluster first.

Node quota

Number of nodes is limited by default to 5,000 and can be increased on demand to 15,000. For more information, see the Clusters larger than 5000 nodes section.

Understanding limits

Kubernetes, as any other system, has limits which needs to be taken into account while designing applications and planning their growth.

GKE supports up to 15,000 nodes. However, Kubernetes is a complex system with a large feature surface. The number of nodes is only one of many dimensions on which Kubernetes can scale. Other dimensions include the total number of Pods, Services, or backends behind a Service. For considerations relevant to creating large clusters, see the Clusters larger than 5000 nodes section.

You shouldn't stretch more than one dimension at the time. This stress can cause problems even in smaller clusters.

For example, trying to schedule 110 Pods per node in a 5,000-node cluster likely won't succeed because the number of Pods, the number of Pods per node, and the number of nodes would be stretched too far. We recommend up to 30 pods per node on average in large clusters.

Dimension limits

You can read the official list of Kubernetes limits.

Neither this list nor the examples below form an exhaustive list of Kubernetes limits. Those numbers are obtained using a plain Kubernetes cluster with no extensions installed. Extending Kubernetes clusters with webhooks or CRDs is common but can constrain your ability to scale the cluster.

Most of those limits are not enforced, so you can go above them. Exceeding limits won't make the cluster instantly unusable. Performance degrades (sometimes shown by failing SLOs) before failure. Also, some of the limits are given for the largest possible cluster. In smaller clusters, limits are proportionally lower.

  • Number of Pods per node. GKE has a hard limit of 110 Pods per node. This assumes an average of two or fewer containers per Pod. Having too many containers might reduce the limit of 110 because some resources are allocated per container.

  • Total number of Services should stay below 10000. The performance of iptables degrades if there are too many services or if there is a high number of backends behind a Service.

  • Number of Pods behind a single Service is safe if kept below 1,000 since GKE 1.19 and below 250 for GKE 1.18 or earlier. The GKE version requirement applies to both the nodes and the control plane. Every node has a kube-proxy or Dataplane V2 agent (anetd) that watches for any Service changes. So the larger a cluster is, the more data has to be sent. This is especially visible in clusters with more than 500 nodes. You can exceed the limit to some extent, as long as churn of Pods behind a Service is kept at a minimum or the size of the cluster is kept small. In GKE 1.18 and earlier, data is propagated using Endpoint objects, which does not scale well. In GKE 1.19 and later, information about the endpoints is instead split between separate EndpointSlices, reducing the amount of data that needs to be transferred on each change. Endpoints are still available for components, but in GKE 1.22 and later, any endpoints above 1,000 will be automatically truncated.

  • Number of Services per namespace should not exceed 5,000. After this, the number of Service environment variables outgrows shell limits, causing Pods to crash on startup. From Kubernetes 1.13 you can opt-out from having those variables populated by setting enableServiceLinks in PodSpec to false.

  • Total number of objects. There is a limit on all objects in a cluster, both built-in and CRDs. It depends on their size (how many bytes are stored in etcd); how frequently they change; and on access patterns (for example, frequently reading of all objects of a given type would stop etcd much faster).

  • Total size of objects. Total sum of all objects for each resource type shouldn't exceed 800 MiB. This limit is separate for each resource type. For example, you can create 750MiB of pods and 750MiB for secrets, but you cannot create 850MiB of secrets only. If you create more than 800 MiB of objects, there is a risk that either Kubernetes controllers or customer-owned controllers will fail to initialize, which might cause disruptions.

  • Rate of pod changes. There are hard limits on the number of pod additions, updates and deletions in a cluster. They are driven by the kube-controller-manager and the kube-scheduler queries per second (QPS) settings. Smaller clusters use default Kubernetes limits of 20 QPS for the kube-controller-manager and 50 QPS for the kube-scheduler. Clusters larger than 500 nodes use 100 QPS client limit for both components. Keep in mind that deleting a pod requires 2 separate calls from the kube-controller-manager and the QPS limit for pod removal is shared with other resource types (for example, EndpointSlices), so actual deletion throughput is limited to correspondingly 10 or 50 pods per second but depends on many factors and might be reduced (for example, when a pod is a part of a service).

Clusters larger than 5,000 nodes

GKE supports up to 15,000 nodes in a single cluster. It has to be said, though, that all above limits remain in place, in particular the limit on number of objects resulting from etcd limitations. The effect is, not every workload that was deployed on 5,000 nodes, will scale up to 15,000 nodes.

In addition, clusters larger than 5,000 nodes must be both regional and private.

Therefore, if you want to create a cluster larger than 5,000 nodes, you need to create a support ticket first, requesting quota increase above 5,000 nodes. Google team will then contact you to find out more about your workload and advise if it will scale as you need it to - or will recommend, how to modify the workload for better scalability. As part of this activity we will also help you in appropriately increasing other relevant quotas.

Disable automounting default service account

When you create a Pod, it automatically has the default service account. You can use the default service account to access the Kubernetes API from inside a Pod using automatically mounted service account credentials.

For every mounted Secret, the kubelet watches the kube-apiserver for changes to that Secret. In large clusters this translates to thousands of watches and can put a significant load on the kube-apiserver. If the Pod is not going to access Kubernetes API, you should disable the default service account.

What's next?