What is scalability?
Kubernetes is a complex system, and its ability to scale is determined by multiple factors. Some of these factors include the type and number of nodes in a node pool, the types and numbers of node pools, the number of Pods available, how resources are allocated to Pods, and the number of Services or backends behind a Service.
Preparing for availability
Choosing a regional or zonal control plane
Due to architectural differences, regional clusters are better suited for high availability. Regional clusters have multiple master nodes across multiple compute zones in a region while zonal clusters have one master node in a single compute zone.
If a zonal cluster is upgraded, the single master VM experiences downtime during which the Kubernetes API is not available until the upgrade is complete.
In regional clusters, the control plane remains available during cluster maintenance like rotating IPs, upgrading master VMs, or resizing clusters or node pools. When upgrading a regional cluster, two out of three master VMs are always running during the rolling upgrade, so the Kubernetes API is still available. Similarly, a single-zone outage won't cause any downtime in the regional control plane.
However, the more highly available regional clusters come with certain trade-offs:
Changes to the cluster's configuration take longer because they must propagate across all masters in a regional cluster instead of the single control plane in zonal clusters.
You might not be able to create or upgrade regional clusters as often as zonal clusters. If VMs cannot be created in one of the zones, whether from a lack of capacity or other transient problem, clusters cannot be created or upgraded.
Due to these trade-offs, zonal and regional clusters have different use cases:
- Use zonal clusters to create or upgrade clusters rapidly when availability is less of a concern.
- Use regional clusters when availability is more important than flexibility.
Carefully select the cluster type when you create a cluster because you cannot change it after the cluster is created. Instead, you must create a new cluster then migrate traffic to it. Migrating production traffic between clusters is possible but difficult at scale.
Choosing multi-zonal or zonal node pools
To achieve high availability, the Kubernetes control plane and its nodes need to be spread across different zones. GKE offers two types of node pools: zonal and multi-zonal.
To deploy a highly available application, distribute your workload across multiple compute zones in a region by using multi-zonal node pools which distribute nodes uniformly across zones.
If all your nodes are in the same zone, you won't be able to schedule Pods if that zone becomes unreachable. Using multi-zonal node pools has certains trade-offs:
GPUs are available only in specific zones. It may not be possible to get them in all zones in the region.
Round-trip latency between locations within a single region is expected to stay below 1ms on the 95th percentile. The difference in traffic latency between zonal and interzonal traffic should be negligible.
The price of egress traffic between zones in the same region is available on the Compute Engine pricing page.
Preparing to scale
Kubernetes workloads require networking, compute, and storage. You need to provide enough CPU and memory to run Pods. However, there are more parameters of underlying infrastructure that can influence performance and scalability of a GKE cluster.
GKE offers two types of cluster networking: older route-based and newer VPC-native.
Routes-based cluster: Each time a node is added, a custom route is added to the routing table in the VPC network.
VPC-native cluster: In this mode, the VPC network has a secondary range for all Pod IP addresses. Each node is then assigned a slice of the secondary range for its own Pod IP addresses. This allows the VPC network to natively understand how to route traffic to Pods without relying on custom routes. A single VPC network can have up to 15k connected VMs, so, effectively, a VPC-native cluster scales up to 5000 nodes.
In regular GKE clusters all nodes have public IP addresses. In Private clusters, nodes only have internal IP addresses to isolate nodes from inbound and outbound connectivity to the Internet. GKE uses VPC network peering to connect VMs running the Kubernetes apiserver with the rest of the cluster. This allows higher throughput between GKE master and nodes, as traffic is not routed via public Internet.
Using private clusters has the additional security benefit that nodes are not exposed to the Internet.
Cluster load balancing
GKE Ingress and load balancer Services in Compute Engine configure and deploy Compute Engine load balancers to expose Kubernetes workloads outside the cluster and also on the public internet. The GKE Ingress and Service controllers deploy objects such as forwarding rules, URL maps, backend services, network endpoint groups, and more on behalf of GKE workloads. Each of these resources has inherent quotas and limits and these limits also apply in GKE. When any particular Compute Engine resource has reached its quota, it will prevent a given Ingress or Service from deploying correctly and errors will appear in the resource's events.
Here are some recommendations and scale limits to keep in mind when using GKE Ingress and Services:
- Ingress (External & Internal) Services that are exposed through GKE: We recommend that you use Container-native load balancing with NEGs with Services exposed through GKE Ingress.
- Ingress resources that don't use NEGs are only supported on clusters up to 1,000 nodes. Ingress with NEGs do not have a cluster node limit.
- Internal Load Balancer Services: Services that deploy internal Compute Engine TCP/UDP load balancers are only supported on clusters up to 250 nodes.
If you need to scale higher contact your Google Cloud sales team to increase this limit. Note that Ingress for Internal HTTP Load Balancing does not have per-cluster node limitations and so it can be used in place of the internal TCP/UDP load balancer for internal HTTP traffic.
Service discovery in GKE is provided through kube-dns which is a centralized resource to provide DNS resolution to Pods running inside the cluster. This can become a bottleneck on very large clusters or for workloads which have high request load. GKE automatically autoscales kube-dns based on the size of the cluster to increase its capacity. When this capacity is still not enough, GKE offers distributed, local resolution of DNS queries on each node with NodeLocal DNSCache. This provides a local DNS cache on each GKE node which answers queries locally, distributing the load and providing faster response times.
Managing IPs in VPC-native clusters
A VPC-native cluster uses the primary IP range for nodes and two secondary IP ranges for Pods and Services. The maximum number of nodes in VPC-native clusters can be limited by available IP addresses. The number of nodes is determined by both the primary range (node subnet) and the secondary range (Pod subnet). The maximum number of Pods and Services is determined by the size of the cluster's secondary ranges, Pod subnet and Service subnet, respectively.
- The Pod secondary range defaults to /14 (262,144 IP addresses).
- Each node has /24 range assigned for its Pods (256 IP addresses for its Pods).
- The node's subnet is /20 (4092 IP addresses).
However, there must be enough addresses in both ranges (node and Pod) to provision a new node. With defaults, only 1024 can be created due to the number of Pod IPs.
By default there can be a maximum of 110 Pods per node, and each node in the cluster has allocated /24 range for its Pods. This results in 256 Pod IPs per node. By having approximately twice as many available IP addresses as possible Pods, Kubernetes is able to mitigate IP address reuse as Pods are added to and removed from a node. However, for certain applications which plan to schedule a smaller number of Pods per node, this is somewhat wasteful. The Flexible Pod CIDR feature allows per-node CIDR block size for Pods to be configured and use fewer IP addresses.
By default, the secondary range for Services is set to /20 (4,096 IP addresses), limiting the number of Services in the cluster to 4096.
Configuring nodes for better performance
GKE nodes are regular Google Cloud virtual machines. Some of their parameters, for example the number of cores or size of disk, can influence how GKE clusters perform.
In Google Cloud, the number of cores allocated to the instance determines its network capacity. One virtual core provides 2Gbps egress bandwidth, and the maximum bandwidth is 16 Gbps or 32 Gbps on Skylake machines (beta feature). All shared-core machine types are limited to 1 Gbps.
IOPS and disk throughput
In Google Cloud, the size of persistent disks determines the IOPS and throughput of the disk. GKE typically uses Persistent Disks as boot disks and to back Kubernetes' Persistent Volumes. Increasing disk size increases both IOPS and throughput, up to certain limits.
Each persistent disk write operation contributes to your virtual machine instance's cumulative network egress cap. Thus IOPS performance of disks, especially SSDs, also depend on the number of vCPUs in the instance in addition to disk size. Lower core VMs have lower write IOPS limits due to network egress limitations on write throughput.
If your virtual machine instance has insufficient CPUs, your application won't be able to get close to IOPS limit. As a general rule, you should have one available CPU for every 2000-2500 IOPS of expected traffic.
Workloads that require high capacity or large numbers of disks need to consider the limits of how many PDs can be attached to a single VM. For regular VMs, that limit is 128 disks with a total size of 64 TB, while shared-core VMs have a limit of 16 PDs with a total size of 3 TB. Google Cloud enforces this limit, not Kubernetes.
Compute Engine API quota
GKE masters use Compute Engine API to discover metadata of nodes in the cluster or to attach persistent disks to VM instances.
By default, Compute Engine API allows 2000 read requests per 100 seconds, which may not be sufficient for clusters with more than 100 nodes.
If you are planning to create a cluster with more than 100 nodes, we recommend
Read requests per 100 seconds and
Read requests per 100 seconds per user
quotas to at least 4000. You can change these settings in the
Google Cloud Console.
Kubernetes, as any other system, has limits which needs to be taken into account while designing applications and planning their growth.
Kubernetes supports up to 5000 nodes in a single cluster. However, Kubernetes is a complex system with a large feature surface. The number of nodes is only one of many dimensions on which Kubernetes can scale. Other dimensions include the total number of Pods, Services, or backends behind a Service.
You shouldn't stretch more than one dimension at the time. This stress can cause problems even in smaller clusters.
For example, trying to schedule 110 Pods per node in a 5K node cluster likely won't succeed because the number of Pods, the number of Pods per node, and the number of nodes would be stretched too far.
You can read the official list of Kubernetes limits.
Neither this list nor the examples below form an exhaustive list of Kubernetes limits. Those numbers are obtained using a plain Kubernetes cluster with no extensions installed. Extending Kubernetes clusters with webhooks or CRDs is common but can constrain your ability to scale the cluster.
Most of those limits are not enforced, so you can go above them. Exceeding limits won't make the cluster instantly unusable. Performance degrades (sometimes shown by failing SLOs) before failure. Also, some of the limits are given for the largest possible cluster. In smaller clusters, limits are proportionally lower.
Number of Pods per node. GKE has a hard limit of 110 Pods per node. This assumes an average of two or fewer containers per Pod. Having too many containers might reduce the limit of 110 because some resources are allocated per container.
Total number of Services should stay below 10000. The performance of iptables degrades if there are too many services or if there is a high number of backends behind a Service.
Number of Pods behind a single Service is safe if kept below 250. Kube-proxy runs on every node and watches all changes to endpoints and Services. So the bigger cluster is, the more data has to be sent. The hard limit is around 5000, depending on the lengths of the Pod names and their namespaces. If they are longer, more bytes are stored in etcd, where the Endpoint object exceeds the maximum size of etcd row. With more backend traffic updates become large, especially in large clusters (500+ nodes). You may go above this number slightly as long as churn of Pods behind the Service is kept at a minimum or the size of the cluster is kept small.
Number of Services per namespace should not exceed 5000. After this, the number of Service environment variables outgrows shell limits, causing Pods to crash on startup. From Kubernetes 1.13 you can opt-out from having those variables populated by setting
enableServiceLinksin PodSpec to false.
Total number of objects. There is a limit on all objects in a cluster, both built-in and CRDs. It depends on their size (how many bytes are stored in etcd); how frequently they change; and on access patterns (for example, frequently reading of all objects of a given type would kill etcd much faster).
Disable automounting default service account
When you create a Pod, it automatically has the default service account. You can use the default service account to access the Kubernetes API from inside a Pod using automatically mounted service account credentials.
For every mounted Secret, the kubelet watches the kube-apiserver for changes to that Secret. In large clusters this translates to thousands of watches and can put a significant load on the kube-apiserver. If the Pod is not going to access Kubernetes API, you should disable the default service account.