This document is for cloud architects and network engineers who are onboarding to Google Kubernetes Engine (GKE).
For a summarized checklist of all the best practices, see Summary of best practices.
Preparation
This document provides a comprehensive list of tasks and decisions that you should take to build an enterprise-ready system using GKE. This list provides recommendations and best practices for onboarding, uses infrastructure as code principles, and links to reference documentation.
Configure landing zone properties
Decide on the security for your Google Cloud landing zone.
For more information, see Enterprise foundations blueprint.
Environment
This section lists features, tools, and configurations to consider when setting up the environment that will host your clusters.
Set up Terraform
Use Terraform and store your configuration files in a repository. Terraform is an infrastructure as code tool that lets you build, change, and version resources using a declarative configuration.
To get started, see Getting Started with the Google provider.
You can also use the Terraform examples in the Cloud Foundation Fabric GitHub repo.
For more information, see Best practices for using Terraform.
Store Terraform state
Use Cloud Storage to store your Terraform state. This lets you share information between different Terraform configurations by using remote state to reference other root modules.
For more information, see Store Terraform state in a Cloud Storage bucket.
Create a metrics scope using Terraform
Create a monitored project and add all Google Cloud projects to it. This lets you view metrics data for groups of projects together in the Monitoring page of the Google Cloud console.
Set up Helm
Use the Helm package manager to define Kubernetes resources.
Set up Artifact Registry
Use Artifact Registry to store and manage your packages and Docker container images. Configure Artifact Registry to store artifacts from Cloud Build and deploy artifacts to GKE.
Consider using the same location for your Artifact Registry and GKE cluster for the following reasons:
- You aren't billed for network egress from Artifact Registry to GKE.
- Lower latency pulling from Artifact Registry.
- You can use image streaming to pull container images.
Set up Binary Authorization
Use Binary Authorization to provide software supply-chain security.
To get started, see Set up Binary Authorization for GKE.
Cluster config
When you create a cluster, you should choose cluster configuration options that meet your requirements. Several key features, such as cluster isolation mode, are immutable. This means that you cannot modify the feature without re-creating the cluster. Other features, such as node auto-provisioning, which helps you scale your cluster and sustain workloads, can be modified after cluster creation.
Choose a mode of operation
GKE has two modes of operation, the fully-managed Autopilot mode and the more configurable Standard mode. We recommend using Autopilot for most production workloads unless you need granular control over your clusters.
For more information, see Autopilot overview.
Isolate your cluster
Configure your network isolation to minimize control plane exposure.
Configure backup for GKE
Use backup for GKE to backup and restore workloads in GKE clusters.
For more information, see Install backup for GKE.
Use Container-Optimized OS node images
Use Container-Optimized OS node images, which are optimized to enhance node security.
For more information, see Available node images
Enable node auto-provisioning
Enable node auto-provisioning to automatically manage and scale node pools.
Node auto-provisioning is enabled by default in Autopilot clusters.
For more information, see Enabling node auto-provisioning
Separate your kube-system Pods
Separate your kube-system Pods from your workloads to prevent your cluster
from having scale down issues on underutilized nodes that run kube-system
Pods.
For more information, see
Separating kube-system pods from your workloads.
Security
How you configure your cluster security features impacts your ability to keep your cluster secure and reduce the potential attack surface. For example, the security posture dashboard provides continuous scanning for vulnerabilities and common misconfigurations in your workloads.
In addition to the steps in this section, review the GKE hardening guide and apply any additional security controls that are relevant to your organization's security needs.
Use the security posture dashboard
Use the security posture dashboard to automate detection and reporting of common security concerns across multiple clusters and workloads, with minimal intrusion and disruption to your running applications.
Security posture is enabled by default in Autopilot clusters in GKE versions 1.27 and later. For more information, see About the security posture dashboard.
Use group authentication
Use groups to manage your users.
For more information, see Google Groups for RBAC.
Use RBAC to restrict access to cluster resources
Assign the appropriate IAM roles for GKE to groups and users to provide permissions at the project level and use RBAC to grant permissions on a cluster and namespace level.
For more information, see Use namespaces and RBAC to restrict access to cluster resources.
Enable Shielded GKE Nodes
Enable Shielded GKE Nodes to increase the security of GKE nodes.
Shielded GKE Nodes are enabled by default in Autopilot clusters.
For more information, see Enable Shielded GKE Nodes.
Enable Workload Identity Federation for GKE
Authenticate to Google APIs using Workload Identity Federation for GKE.
Workload Identity Federation for GKE is enabled by default in Autopilot clusters. In Autopilot clusters, you can configure your applications to use Workload Identity Federation for GKE without needing to enable it first.
For more information, see Enable Workload Identity Federation for GKE.
Enable security bulletin notifications
Enable security bulletin notifications to receive notifications when security bulletins are available that are relevant to your cluster.
For more information, see Enable security bulletin notifications.
Use least privilege Google service accounts
Create a minimally privileged service account for your nodes to use instead of the Compute Engine default service account.
For more information, see Use least privilege Google service accounts.
Restrict network access to the control plane and nodes
Limit the exposure of your cluster control planes to the internet by using private nodes.
For more information, see Restrict network access to the control plane and nodes
Restrict access to cluster API discovery
Configure authorized networks to restrict access to your cluster's discovery APIs.
For more information, see Restrict access to cluster API discovery.
Use namespaces to restrict access to cluster resources
Give teams least-privilege access to Kubernetes by creating separate namespaces or clusters for each team and environment.
For more information, see Use namespaces and RBAC to restrict access to cluster resources.
Networking
How you configure your networking environment in Google Cloud impacts your ability to scale, right-size, and troubleshoot your GKE clusters. For example, you must consider how many unique IP addresses that you need before you create a cluster. Otherwise, you might run out of IP addresses when your cluster expands.
Create a custom mode VPC
Create a custom mode Virtual Private Cloud (VPC) to choose IP address ranges that won't overlap with other IP address ranges in your environment.
For more information, see Use custom subnet mode.
Create a proxy-only subnet
Create a proxy-only subnet to expose Services with internal Application Load Balancers.
For more information, see Create a load balancer subnet.
Configure Shared VPC
Use Shared VPC to delegate responsibilities, such as creating and managing instances, to Service Project Admins while maintaining centralized control over network resources such as subnets, routes, and firewalls.
For more information, see Use Shared VPC networks.
Connect the cluster's VPC network to an on-premises network
Use
Cloud VPN tunnels or
Cloud Interconnect VLAN attachments
to connect your cluster's VPC network to an
on-premises network or other network from which you plan to administer your
cluster. This lets you administer the cluster using tools like kubectl by
connecting to the control plane private endpoint.
Make sure that the subnet IP address ranges in your cluster's VPC network don't conflict with the IP address ranges used in the network to which you connect.
Enable Cloud NAT
Enable Cloud NAT so that private nodes can connect to internet IP address destinations and external IP addresses for Google Cloud resources, such as external load balancers.
To get started, see Set up Cloud NAT with GKE.
For more information, see Use Cloud NAT for internet access from clusters.
Configure Cloud DNS for GKE
Enable Cloud DNS for GKE to improve DNS scalability on your cluster.
For more information, see Use Cloud DNS for GKE.
Configure NodeLocal DNSCache
Improve DNS performance by enabling NodeLocal DNSCache.
For more information, see Enable NodeLocal DNSCache.
Create firewall rules
If you create firewall rules to block all access to GKE clusters or virtual machines, you must manually create firewall rules to allow GKE clusters to function.
For more information, see Automatically created firewall rules.
Multi-tenancy
Multi-tenancy allows Kubernetes cluster operators or administrators to design and operate clusters that are split across multiple tenants. In this case, tenant refers to an area of your organization, such as a team, business unit, or application.
Enable multi-tenancy
Enable multi-tenancy to create tenants using the Google Cloud console.
For more information, see Set up multi-tenancy.
Provide tenant-specific logs
Use Logging Log Router to provide tenants with log data specific to project workloads.
For more information, see the following resources:
Create folders and projects
Use folders and projects to capture how your organization manages Google Cloud resources and to enforce a separation of concerns.
For more information, see Establish a folder and project hierarchy.
Configure access control
Assign the appropriate Identity and Access Management (IAM) roles to each group in your organization based on their scope of operations.
For more information, see Assign roles using IAM.
Enforce resource quotas
Enforce resource quotas to ensure all tenants that share a cluster have fair access to the cluster resources.
For more information, see Enforce resource quotas.
Isolate tenants using namespaces
Create namespaces to provide a logical isolation between tenants that are on the same cluster.
For more information, see Create namespaces.
Monitoring
To keep your GKE clusters healthy, optimized, and scalable over time, you should collect and use system metrics.
Configure GKE alert policies
Get started with monitoring GKE by enabling default GKE alert policies. To enable the alert policy, provide a notification channel.
For more information, see the following resources:
Enable Google Cloud Managed Service for Prometheus
Enable Google Cloud Managed Service for Prometheus to monitor and alert on your workloads. This step is not necessary for Autopilot clusters, because Autopilot automatically deploys Google Cloud Managed Service for Prometheus.
Google Cloud Managed Service for Prometheus lets you collect metrics from cluster nodes, DaemonSets, and the control plane. You can use these metrics for tasks such as the following:
- Tracking the number of objects in your cluster state database (etcd or Spanner) to predict when a cluster will reach a limit
- Monitoring the number of API requests per second to prevent a workload from causing problems for the entire cluster.
For more information, see Get started with managed collection.
Configure control plane metrics
Collect Kubernetes control plane metrics using Cloud Monitoring.
For more information, see Use control plane metrics.
Enable metrics packages
We recommend that you configure the following packages:
- Kube State Metrics
- Kubelet/cAdvisor
- For Standard clusters, Node Exporter
For more information, see Configuring rules and alerts using Terraform.
Maintenance
No production cluster can exist without planned and configured maintenance operations. This section shows you how to configure your cluster and workloads to allow for upgrades and maintenance operations so that you can apply security fixes without service interruption and so that application teams can adopt new Kubernetes features when needed.
Create environments
Use multiple environments to minimize risk and downtime. At minimum, create a production environment and a pre-production or test environment.
For more information, see Set up multiple environments.
Subscribe to Pub/Sub events
Use Pub/Sub to proactively receive updates about GKE upgrades.
For more information, see Receive updates about new GKE versions.
Enroll in release channels
Enroll your cluster in a release channel to keep your clusters up-to-date with the latest GKE and Kubernetes updates.
For more information, see Enroll clusters in release channels.
Configure maintenance windows
Create maintenance windows and exclusions to increase upgrade predictability and to align upgrades with off-peak business hours.
Create a windows and maintenance exclusion by adding the following block to your Terraform configuration:
maintenance_policy {
  recurring_window {
    start_time = "WINDOW_START_TIME"
    end_time = "WINDOW_END_TIME"
    recurrence = "RECURRENCE"
  }
  maintenance_exclusion{
    exclusion_name = "Batch job"
    start_time = "EXCLUSION_START_TIME"
    end_time = "EXCLUSION_END_TIME"
    exclusion_options {
      scope = "SCOPE"
    }
  }
}
Replace the following:
- WINDOW_START_TIMEand- WINDOW_END_TIME: When the recurring window should start and end. For example,- 2022-01-01T00:00:00Z.
- RECURRENCE: Recurrence. For more information, see the Terraform docs for- google_container_cluster. For example,- FREQ=DAILY.
- EXCLUSION_START_TIMEand- EXCLUSION_END_TIME: When the maintenance exclusion should start and end. For example,- 2022-01-01T00:00:00Z.
- SCOPE: The scope for the exclusion. For example,- NO_UPGRADES.
For more information, see Schedule maintenance windows and exclusions.
Set Compute Engine quotas
Review your default Compute Engine quotas to ensure your GKE clusters have enough resources.
For more information, see Compute Engine quotas and best practices.
Configure cost controls
Use GKE cost allocation to view a breakdown of your GKE cluster costs.
To collect billing data, set up Cloud Billing data export to BigQuery.
For more information, see Best practices for running cost-optimized Kubernetes applications on GKE and Best practices for enterprise multi-tenancy.
Configure a billing budget
Create a Cloud Billing budget to monitor your Google Cloud charges and track your costs.
For more information, see Create, edit, or delete budgets and budget alerts.
Summary of best practices
The following table summarizes the best practices recommended in this document: