Plan for large workloads

Autopilot Standard

This page describes the best practices you can follow when managing large workloads on multiple GKE clusters. These best practices cover considerations for distributing workloads across multiple projects and adjusting required quotas.

Best practices for distributing GKE workloads across multiple Google Cloud projects

To better define your Google Cloud project structure and GKE workloads distribution, based on your business requirements, we recommend you consider the following designing and planning actions:

Follow the guidance in Decide a resource hierarchy for your Google Cloud landing zone to make initial decisions for your organization's structure for folders and projects. Google Cloud recommends using Resource hierarchy elements like folders and projects to divide your workload based on your own organizational boundaries or access policies.
Consider if you need to split your workloads because of project quotas. Google Cloud uses per project quotas to restrict usage of shared resources. You need to follow the recommendations described below and adjust the project quotas for large workloads. For most of the workloads you should be able to achieve required, higher quotas in just a single project. This means that quotas should not be the primary driver for splitting your workload between multiple projects. Keeping your workloads in a smaller number of projects simplifies the administration of your quotas and workloads.
Consider if you plan to run very large workloads (scale of hundreds of thousands of CPUs or more). In such a case splitting your workload into several projects can increase availability of cloud resources (like CPUs or GPUs). This is possible because of using optimized configuration of the zone virtualization. In such cases please contact your Account Manager to get special support and recommendations.

Best practices for adjusting quotas for large GKE workloads

This section describes guidelines for adjusting quotas for Google Cloud resources used by GKE workloads. Adjust the quotas for your projects based on the following guidelines. To learn how to manage your quota using the Google Cloud console, see Working with quotas.

Compute Engine quotas and best practices

GKE clusters, running in both Autopilot and Standard mode, use Compute Engine resources to run your workloads. In contrast to Kubernetes control plane resources that are internally managed by Google Cloud, you can manage and evaluate Compute Engine quotas that your workflows use.

Compute Engine quotas, for both resources and APIs, are shared by all GKE clusters hosted in the same project and region. The same quotas are also shared with other (not GKE related) Compute Engine resources (like standalone VM instances or instance groups).

Default quota values can support several hundred of worker nodes and require adjustment for larger workloads. However, as a platform administrator, you can proactively adjust Compute Engine quotas to ensure that your GKE clusters have enough resources. You should also consider future resource needs when evaluating or adjusting the quota values.

Quotas for Compute Engine resources used by GKE worker nodes

The following table lists resource quotas for the most common Compute Engine resources used by GKE worker nodes. These quotas are configured per project, and per region. The quotas must cover the maximum combined size of the GKE worker nodes used by your workload and also other Compute Engine resources not related to GKE.

Resource quota	Description
CPUs	Number of CPUs used by all worker nodes of all clusters.
Type of CPUs	Number of each specific type of CPU used by all worker nodes of all clusters.
VM instances	Number of all worker nodes. This quota is automatically calculated as 10x the number of CPUs.
Instances per VPC network	Number of all worker nodes connected to the VPC network.
Persistent Disk standard (GB)	Total size of standard persistent boot disks attached to all worker nodes.
Persistent Disk SSD (GB)	Total size of SSD persistent boot disks attached to all worker nodes.
Local SSD (GB)	Total size of local SSD ephemeral disks attached to all worker nodes.

Make sure to also adjust quotas used by resources that your workload might require, such as GPUs, IP addresses, or preemptive resources.

Quotas for Compute Engine API calls

Large or scalable clusters require a higher number of Compute Engine API calls. GKE makes these Compute Engine API calls during activities such as:

Checking state of the compute resources.
Adding or removing new nodes to the cluster.
Adding or removing new node pools.
Periodic labeling of resources.

When planning your large-size cluster architecture, we recommend you do the following:

Observe historical quota consumption.
Adjust the quotas as needed while keeping a reasonable buffer. You can refer to the following best practice recommendations as a starting point, and adjust the quotas based on your workload needs.
Because quotas are configured per region, adjust quotas only in the regions where you plan to run large workloads.

The following table lists quotas for Compute Engine API calls. These quotas are configured per project, independently per each region. Quotas are shared by all GKE clusters hosted in the same project and in the same region.

API quota	Description	Best practices
Queries per minute per region	These calls are used by GKE to perform various checks against the state of the various compute resources.	For projects and regions with several hundreds of dynamic nodes, adjust this value to 3,500. For projects and regions with several thousands of highly dynamic nodes, adjust this value to 6,000.
Read requests per minute pe region	These calls are used by GKE to monitor the state of VM instances (nodes).	For projects and regions with several hundreds of nodes, adjust this value to 12,000. For projects and regions with thousands of nodes, adjust this value to 20,000.
List requests per minute per region	These calls are used by GKE to monitor the state of instance groups (node pools).	For projects and regions with several hundreds of dynamic nodes, don't change the default value because it is enough. For projects and regions with thousands of highly dynamic nodes, in multiple node pools, adjust this value to 2,500.
Instance List Referrer requests per minute per region	These calls are used by GKE to obtain information about running VM instances (nodes).	For projects and regions with thousands of highly dynamic nodes, adjust this value to 6,000.
Operation read requests per minute per region	These calls are used by GKE to obtain information about ongoing Compute Engine API operations.	For projects and regions with thousands of highly dynamic nodes, adjust this value to 3,000.

Cloud Logging API and Cloud Monitoring API quotas and best practices

Depending on your cluster configuration, large workloads running on GKE clusters might generate a large volume of diagnostic information. When exceeding Cloud Logging API or Cloud Monitoring API quotas, the logging and monitoring data might be lost. We recommend you configure the verbosity of the logs and adjust Cloud Logging API and Cloud Monitoring API quotas to capture generated diagnostic information. Managed Service for Prometheus consumes Cloud Monitoring quotas.

Because every workload is different, we recommend you do the following:

Observe historical quota consumption.
Adjust the quotas or adjust logging and monitoring configuration as needed. Keep a reasonable buffer for unexpected issues.

The following table lists quotas for Cloud Logging APIs and Cloud Monitoring APIs calls. These quotas are configured per project and are shared by all GKE clusters hosted in the same project.

Service	Quota	Description	Best practices
Cloud Logging API	Write requests per minute	GKE uses this quota when adding entries to log files stored in Cloud Logging.	Log insertion rate is dependent on the amount of logs generated by the pods in your cluster. Increase your quota based on the number of pods, verbosity of applications logging, and logging configuration. To learn more, see managing GKE logs.
Cloud Monitoring API	Time series ingestion requests per minute	GKE uses this quota when sending Prometheus metrics to Cloud Monitoring: Prometheus metrics consume about 1 call per second for every 200 samples per second you collect. This ingestion volume depends on your GKE workload and configuration; exporting more Prometheus time series will result in more quota consumed.	Monitor and adjust this quota as appropriate. To learn more, see managing GKE metrics.

Service

Quota

Description

Best practices

Cloud Logging API

Write requests per minute

GKE uses this quota when adding entries to log files stored in Cloud Logging.

Log insertion rate is dependent on the amount of logs generated by the pods in your cluster. Increase your quota based on the number of pods, verbosity of applications logging, and logging configuration.

To learn more, see managing GKE logs.

Cloud Monitoring API

Time series ingestion requests per minute

GKE uses this quota when sending Prometheus metrics to Cloud Monitoring:

Prometheus metrics consume about 1 call per second for every 200 samples per second you collect. This ingestion volume depends on your GKE workload and configuration; exporting more Prometheus time series will result in more quota consumed.

Monitor and adjust this quota as appropriate.

To learn more, see managing GKE metrics.

GKE node quota and best practices

GKE supports the following limits:

Up to 15,000 nodes in a single cluster with the default quota set to 5,000 nodes. This quota is set separately for each GKE cluster and not per project as other quotas.
In version 1.31 and later, GKE supports large clusters up to 65,000 nodes.

If you plan to scale your cluster above 5,000 nodes, perform the following steps:

Identify the cluster that you want to scale beyond 5,000 nodes.
Make sure your workload stays within cluster limits and GKE quotas after scaling.
Make sure you raise Compute Engine quotas as required for your scaled workload.
Make sure your cluster's availability type is regional and your cluster uses Private Service Connect.
To request an increase of the quota for number of nodes per cluster, contact Cloud Customer Care. The GKE team will contact you to ensure that your workload follows the scalability best practices and is ready to scale beyond 5,000 nodes on a single cluster.

Best practices for avoiding other limits for large workloads

Limit for number of clusters using VPC Network Peering per network per location

You can create a maximum of 75 clusters that use VPC Network Peering in the same VPC network per location (zones and regions are treated as separate locations). Attempts to create additional clusters above the limit would fail with an error similar to the following:

CREATE operation failed. Could not trigger cluster creation:
Your network already has the maximum number of clusters: (75) in location us-central1.

GKE clusters with private nodes created before version 1.29 use VPC Network Peering to provide internal communication between Kubernetes API Server (managed by Google) and private nodes having only internal addresses.

To solve this issue, you can use clusters that use Private Service Connect (PSC) connectivity. Clusters with PSC connectivity provide the same isolation as a cluster using VPC Network Peering, without the 75 clusters limitation. Clusters with PSC connectivity don't use VPC Network Peering and are not impacted by the limit of the number of VPC peerings.

You can use instructions provided in VPC Network Peering reuse to identify if your clusters use VPC Network Peering.

To avoid hitting the limit while creating new clusters, do the following steps:

Ensure that your cluster uses PSC.
Configure the isolation for node pools to become private by using enable-private-nodes parameter for each node pool.
Optionally, configure the isolation for the control plane by using enable-private-endpoint parameter on cluster level. To learn more, see Customize your network isolation.

Alternatively, contact the Google Cloud support team to raise the limit of 75 clusters using VPC Network Peering. Such requests are evaluated on a case-by-case basis and when increasing the limit is possible, a single digit increase is applied.

What's next?

See our episodes about building large GKE clusters.