Guidelines for creating scalable clusters

Stay organized with collections Save and categorize content based on your preferences.
This document provides guidance to help you decide how to create, configure, and operate Google Kubernetes Engine (GKE) clusters which can accommodate workloads that approach the known limits of Kubernetes.

What is scalability?

In a Kubernetes cluster, scalability refers to the ability of the cluster to grow while staying within its service-level objectives (SLOs). Kubernetes also has its own set of SLOs

Kubernetes is a complex system, and its ability to scale is determined by multiple factors. Some of these factors include the type and number of nodes in a node pool, the types and numbers of node pools, the number of Pods available, how resources are allocated to Pods, and the number of Services or backends behind a Service.

Preparing for availability

Choosing a regional or zonal control plane

Due to architectural differences, regional clusters are better suited for high availability. Regional clusters have multiple control planes across multiple compute zones in a region, while zonal clusters have one control plane in a single compute zone.

If a zonal cluster is upgraded, the control plane VM experiences downtime during which the Kubernetes API is not available until the upgrade is complete.

In regional clusters, the control plane remains available during cluster maintenance like rotating IPs, upgrading control plane VMs, or resizing clusters or node pools. When upgrading a regional cluster, two out of three control plane VMs are always running during the rolling upgrade, so the Kubernetes API is still available. Similarly, a single-zone outage won't cause any downtime in the regional control plane.

However, the more highly available regional clusters come with certain trade-offs:

  • Changes to the cluster's configuration take longer because they must propagate across all control planes in a regional cluster instead of the single control plane in zonal clusters.

  • You might not be able to create or upgrade regional clusters as often as zonal clusters. If VMs cannot be created in one of the zones, whether from a lack of capacity or other transient problem, clusters cannot be created or upgraded.

Due to these trade-offs, zonal and regional clusters have different use cases:

  • Use zonal clusters to create or upgrade clusters rapidly when availability is less of a concern.
  • Use regional clusters when availability is more important than flexibility.

Carefully select the cluster type when you create a cluster because you cannot change it after the cluster is created. Instead, you must create a new cluster then migrate traffic to it. Migrating production traffic between clusters is possible but difficult at scale.

Choosing multi-zonal or single-zone node pools

To achieve high availability, the Kubernetes control plane and its nodes need to be spread across different zones. GKE offers two types of node pools: single-zone and multi-zonal.

To deploy a highly available application, distribute your workload across multiple compute zones in a region by using multi-zonal node pools which distribute nodes uniformly across zones.

If all your nodes are in the same zone, you won't be able to schedule Pods if that zone becomes unreachable. Using multi-zonal node pools has certains trade-offs:

  • GPUs are available only in specific zones. It may not be possible to get them in all zones in the region.

  • Round-trip latency between zones within a single region might be higher than that between resources in a single zone. The difference should be immaterial for most workloads.

  • The price of egress traffic between zones in the same region is available on the Compute Engine pricing page.

Preparing to scale

Base infrastructure

Kubernetes workloads require networking, compute, and storage. You need to provide enough CPU and memory to run Pods. However, there are more parameters of underlying infrastructure that can influence performance and scalability of a GKE cluster.

Cluster networking

Using a VPC-native cluster is the networking default and the recommended choice for setting up new GKE clusters. VPC-native clusters allow for larger workloads, higher number of nodes, and some other advantages.

In this mode, the VPC network has a secondary range for all Pod IP addresses. Each node is then assigned a slice of the secondary range for its own Pod IP addresses. This allows the VPC network to natively understand how to route traffic to Pods without relying on custom routes. A single VPC network can have up to 15,000 VMs. To learn more about the consequences of this limit, see the VMs per VPC network limit section.

Another approach, which is deprecated and supports no more than 1,500 nodes, is to use a routes-based cluster. A routes-based cluster is not a good fit for large workloads. It consumes VPC routes quota and lacks other benefits of the VPC-native networking. It works by adding a new custom route to the routing table in the VPC network for each new node.

Private clusters

In regular GKE clusters, all nodes have public IP addresses. In private clusters, nodes only have internal IP addresses to isolate nodes from inbound and outbound connectivity to the Internet. GKE uses VPC network peering to connect VMs running the Kubernetes API server with the rest of the cluster. This allows higher throughput between GKE control planes and nodes, as traffic is routed using private IP addresses.

Using private clusters has the additional security benefit that nodes are not exposed to the Internet.

Cluster load balancing

GKE Ingress and Cloud Load Balancing configure and deploy load balancers to expose Kubernetes workloads outside the cluster and also to the public internet. The GKE Ingress and Service controllers deploy objects such as forwarding rules, URL maps, backend services, network endpoint groups, and more on behalf of GKE workloads. Each of these resources has inherent quotas and limits and these limits also apply in GKE. When any particular Cloud Load Balancing resource has reached its quota, it will prevent a given Ingress or Service from deploying correctly and errors will appear in the resource's events.

The following table describes the scaling limits when using GKE Ingress and Services:

Load balancer Node limit per cluster
Internal TCP/UDP load balancer
Network load balancer 1000 nodes per zone
External HTTP(S) load balancer
Internal HTTP(S) load balancer No node limit

If you need to scale further, contact your Google Cloud sales team to increase this limit.


Service discovery in GKE is provided through kube-dns which is a centralized resource to provide DNS resolution to Pods running inside the cluster. This can become a bottleneck on very large clusters or for workloads which have high request load. GKE automatically autoscales kube-dns based on the size of the cluster to increase its capacity. When this capacity is still not enough, GKE offers distributed, local resolution of DNS queries on each node with NodeLocal DNSCache. This provides a local DNS cache on each GKE node which answers queries locally, distributing the load and providing faster response times.

Managing IPs in VPC-native clusters

A VPC-native cluster uses the primary IP range for nodes and two secondary IP ranges for Pods and Services. The maximum number of nodes in VPC-native clusters can be limited by available IP addresses. The number of nodes is determined by both the primary range (node subnet) and the secondary range (Pod subnet). The maximum number of Pods and Services is determined by the size of the cluster's secondary ranges, Pod subnet and Service subnet, respectively.

By default:

  • The Pod secondary range defaults to /14 (262,144 IP addresses).
  • Each node has /24 range assigned for its Pods (256 IP addresses for its Pods).
  • The node's subnet is /20 (4092 IP addresses).

However, there must be enough addresses in both ranges (node and Pod) to provision a new node. With defaults, only 1024 can be created due to the number of Pod IPs.

By default there can be a maximum of 110 Pods per node, and each node in the cluster has allocated /24 range for its Pods. This results in 256 Pod IPs per node. By having approximately twice as many available IP addresses as possible Pods, Kubernetes is able to mitigate IP address reuse as Pods are added to and removed from a node. However, for certain applications which plan to schedule a smaller number of Pods per node, this is somewhat wasteful. The Flexible Pod CIDR feature allows per-node CIDR block size for Pods to be configured and use fewer IP addresses.

If you need more IP addresses than are available in the private space of with RFC 1918, then we recommend using Non-RFC 1918 addresses.

By default, the secondary range for Services is set to /20 (4,096 IP addresses), limiting the number of Services in the cluster to 4096.

Configuring nodes for better performance

GKE nodes are regular Google Cloud virtual machines. Some of their parameters, for example the number of cores or size of disk, can influence how GKE clusters perform.

Reducing Pod initialization times

You can use Image streaming to stream data from eligible container images as your workloads request them, which leads to faster initialization times.

Egress traffic

In Google Cloud, the machine type and the number of cores allocated to the instance determine its network capacity. Maximum egress bandwidth varies from 1 to 32 Gbps, while the maximum egress bandwidth for default e2-medium-2 machines is 2 Gbps. For details on bandwidth limits, see Shared-core machine types.

IOPS and disk throughput

In Google Cloud, the size of persistent disks determines the IOPS and throughput of the disk. GKE typically uses Persistent Disks as boot disks and to back Kubernetes' Persistent Volumes. Increasing disk size increases both IOPS and throughput, up to certain limits.

Each persistent disk write operation contributes to your virtual machine instance's cumulative network egress cap. Thus IOPS performance of disks, especially SSDs, also depend on the number of vCPUs in the instance in addition to disk size. Lower core VMs have lower write IOPS limits due to network egress limitations on write throughput.

If your virtual machine instance has insufficient CPUs, your application won't be able to get close to IOPS limit. As a general rule, you should have one available CPU for every 2000-2500 IOPS of expected traffic.

Workloads that require high capacity or large numbers of disks need to consider the limits of how many PDs can be attached to a single VM. For regular VMs, that limit is 128 disks with a total size of 64 TB, while shared-core VMs have a limit of 16 PDs with a total size of 3 TB. Google Cloud enforces this limit, not Kubernetes.

Limits and quotas relevant for scalability

When designing clusters to grow, consider the general cluster attributes, GKE limits, and complementing limits.

General cluster attributes

  • Total number of objects. There is a limit on all objects in a cluster, both built-in and CRDs. It depends on their size (how many bytes are stored in etcd); how frequently they change; and on access patterns (for example, frequently reading of all objects of a given type would stop etcd much faster).

  • Total size of objects. Total sum of all objects for each resource type shouldn't exceed 800 MiB. This limit is separate for each resource type. For example, you can create 750MiB of pods and 750MiB for secrets, but you cannot create 850MiB of secrets only. If you create more than 800 MiB of objects, there is a risk that either Kubernetes controllers or customer-owned controllers will fail to initialize, which might cause disruptions.

GKE limits

To ensure that your architecture supports large scale GKE clusters, we also recommend you consider the following GKE limits. The limits defined in the following table are controlled by GKE quotas.

For clusters that don't meet the following limits, see the default GKE quotas and limits.

GKE limit Description Best practices
Rate of pod changes GKE has hard limits on the number of Pod recreations in a cluster. The kube-controller-manager and the kube-scheduler queries per second (QPS) settings impact Pod reliability. Smaller clusters use the following default Kubernetes limits:
  • 20 QPS for the kube-controller-manager
  • 50 QPS for the kube-scheduler
  • Clusters larger than 500 nodes use the following limits:

  • 100 QPS for the kube-controller-manager
  • 100 QPS for the kube-scheduler
  • When planning your cluster for running your workloads, consider the following caveats:

    • Deleting a pod requires two separate calls from the kube-controller-manager
    • The QPS limit for Pod removal is shared with other resource types. For example, EndpointSlices.
    • Pod deletion throughput is limited to correspondingly 10 or 50 pods per second but depends on combined factors and might be reduced. For example, when a pod is a part of a service.
    Number of Pods per node GKE has a hard limit of 256 Pods per node. This assumes an average of two or fewer containers per Pod. If you increase the number of containers per Pod, this limit might be lower because GKE allocates more resources per container.

    If possible, upgrade your cluster version to 1.23.8-gke.400 or later. GKE versions 1.23.8-gke.400 and later contains memory improvements of GKE components. This enables you to set maximum Pods per node over the default limit of 110.

    For more information, see manually upgrading a cluster or node pool.

    Log insertion requests per minute

    The Cloud Logging API default quota is of 120,000 for log insertion requests per minute. The Cloud Monitoring API default quota is 6,000 time series insertion rpm. You might need to increase these quotas as your cluster grows.

    Verify current usage in the Google Cloud console. You could use a smaller cluster to estimate the amount of quota you need.

    Log insertion rate depends on the amount of logs generated by the pods in your cluster. You might need to increase this quota based on the number of pods and the amount of logs your applications create.

    Time series insertion rate is related to your cluster size. Clusters with 100 nodes need approximately 30,000 time series insertion rpm. Clusters with 5,000 nodes need approximately of 100,000 rpm.

    Compute Engine API quota

    GKE control planes use Compute Engine API to discover metadata of nodes in the cluster or to attach Persistent Disks to VM instances.

    By default, the Compute Engine API allows 1,500 requests per minute per project for most read requests, but this limit might not be sufficient for clusters with more than 100 nodes.

    If you are planning to create a cluster with more than 100 nodes, increase the `Read requests per minute` quota to at least 3,000. You can change this quota and verify current usage in the Google Cloud console.

    Larger clusters need even larger quota. Clusters with 1,000 nodes need approximately 7,500 read requests per minute per project. Clusters with 5,000 nodes need approximately 15,000 read requests per minute per project.

    For more information about Compute Engine API limits, see API rate limits.

    The preceding table assumes that you have a basic Kubernetes cluster with no extensions installed. Extending Kubernetes clusters with webhooks or CRDs can constrain your ability to scale the cluster. For more information, read the official list of Kubernetes limits.

    Complementing limits

    The following table describes combined scenarios and best practices that might impact cluster performance and reliability. Most of these caveats include complementing limits that are not enforced, so you can go above them. Exceeding these limits won't make the cluster instantly unusable, but its performance degrades (sometimes shown by failing SLOs) before failure.

    Complementing GKE limits Description Best practices
    Open watches Any number of pods mounting the same Secret per ConfigMap on the same node generates one watch. However, above 200,000 watches might affect clusters initialization times. This issue can cause many control plane restarts.

    Define bigger nodes to decrease the likelihood and severity of many watch issues. Higher pod density (bigger nodes) might mitigate the severity of some cases.

    For more information, see the machine series comparison.

    Number of Secrets per cluster If you store more than 30,000 secrets in a cluster with application layer secrets enabled, your cluster might become unstable at upgrade time. This issue can cause potential workload outage. Use immutable secrets. For more information, see encrypt secrets at the application layer.
    Number of Horizontal Pod Autoscaler objects per cluster

    Each Horizontal Pod Autoscaler (HPA) is processed every 15 seconds.

    In GKE 1.21 and earlier, more than 100 HPA objects can cause linear degradation of performance.

    In GKE 1.22 and later, more than 300 HPA objects can cause linear degradation of performance.

    Keep the number of HPA objects within these limits or expect linear degradation of frequency of HPA processing. For example in GKE 1.22 with 2,000 HPAs, a single HPA will be reprocessed every 1 minute and 40 seconds.

    For more information, see autoscaling based on resources utilization and horizontal pod autoscaling scalability.

    Log bandwidth per node

    Monitorings collects logs from cluster nodes. These logs are sent to Google Cloud through the Cloud Logging API. The Monitoring performance is limited by the bandwidth of the logs streams.

    There is no mechanism detecting lost logs if the log bandwidth is higher than the limit.

    Periodically check the Cloud Logging API quota to reduce the likelihood of exceeding the total bandwidth of the node logs over the defined quota.

    For more information, see Cloud Logging API in the Google Cloud console.

    The etcd database total size The etcd database size is limited. For very large clusters with tens of thousands of resources, it is possible to come close to that limit. If you cross this limits, etcd instances may be marked as unhealthy and cause unresponsiveness of your clusters control plane. The limit for the etcd database is 8 GB. Ensure you stay under 6 GB to avoid loosing access to the control plane. If you cross this limit, you might need require manual support from Google Cloud support.
    Total sum of all objects for each resource type in the etcd database Each resource space in the etcd database is limited. For very large clusters with ten of thousands of resources, it is possible to come close to that limit. For example, clusters using many, large secrets or ConfigMaps could experience this issue. The limit for each resource type in the etcd database is 800 MB. Ensure you stay under 600 MB to avoid any issues that will prevent you from creating new resources of this type.
    Total number of Services The performance of iptables degrades if there are too many services or if there is a high number of backends behind a Service.

    Do not create more than 10,000 Services in a cluster.

    For more information, see exposing applications using services.

    Number of Pods behind a single Service

    Every node has a kube-proxy or GKE Dataplane V2 agent (anetd) that watches for any Service changes. The larger a cluster is, the more change related data is processed. This is especially visible in clusters with more than 500 nodes.

    In GKE version 1.18 and earlier, data was propagated using Endpoint objects, which did not scale well. In GKE version 1.19 and later, information about the endpoints is instead split between separate EndpointSlices. This reduces the amount of data transferred on each change. Endpoints are still available for components, but in GKE 1.22 and later, any endpoints above 1,000 Pods are automatically truncated.

    In GKE 1.21 and earlier, keep the number of Pods behind a single Service lower than 1,000.

    In GKE 1.22 and later, keep the number of Pods behind a single Service lower than 10,000.

    The GKE version requirement applies to both the nodes and the control plane.

    For more information, see exposing applications using services.

    Number of Services per namespace The number of Service environment variables outgrows shell limits, causing Pods to crash on startup.

    Keep the number of Services per namespace below 5,000.

    In Kubernetes version 1.13 and later, you can opt-out from having those variables populated by setting enableServiceLinks in PodSpec to false.

    For more information, see exposing applications using services.

    Clusters larger than 5,000 nodes

    GKE supports up to 15,000 nodes in a single cluster. It has to be said, though, that all above limits remain in place, in particular the limit on number of objects resulting from etcd limitations. The effect is, not every workload that was deployed on 5,000 nodes, will scale up to 15,000 nodes.

    In addition, clusters larger than 5,000 nodes must be both regional and private.

    Therefore, if you want to create a cluster larger than 5,000 nodes, you need to create a support ticket first, requesting quota increase above 5,000 nodes. Google team will then contact you to find out more about your workload and advise if it will scale as you need it to - or will recommend, how to modify the workload for better scalability. As part of this activity we will also help you in appropriately increasing other relevant quotas.

    Disable automounting default service account

    When you create a Pod, it automatically has the default service account. You can use the default service account to access the Kubernetes API from inside a Pod using automatically mounted service account credentials.

    For every mounted Secret, the kubelet watches the kube-apiserver for changes to that Secret. In large clusters this translates to thousands of watches and can put a significant load on the kube-apiserver. If the Pod is not going to access Kubernetes API, you should disable the default service account.

    What's next?