Best practices for onboarding to GKE

Autopilot Standard

This document is for cloud architects and network engineers who are onboarding to Google Kubernetes Engine (GKE).

For a summarized checklist of all the best practices, see Summary of best practices.

Preparation

This document provides a comprehensive list of tasks and decisions that you should take to build an enterprise-ready system using GKE. This list provides recommendations and best practices for onboarding, uses infrastructure as code principles, and links to reference documentation.

Document your architecture

Diagram your architecture using the Google Cloud Developer Architecture tool.

For more information, see Introducing a Google Cloud architecture diagramming tool.

Configure landing zone properties

Decide on the security for your Google Cloud landing zone.

For more information, see Enterprise foundations blueprint.

Environment

This section lists features, tools, and configurations to consider when setting up the environment that will host your clusters.

Set up Terraform

Use Terraform and store your configuration files in a repository. Terraform is an infrastructure as code tool that lets you build, change, and version resources using a declarative configuration.

To get started, see Getting Started with the Google provider.

You can also use the Terraform examples in the Cloud Foundation Fabric GitHub repo.

For more information, see Best practices for using Terraform.

Store Terraform state

Use Cloud Storage to store your Terraform state. This lets you share information between different Terraform configurations by using remote state to reference other root modules.

For more information, see Store Terraform state in a Cloud Storage bucket.

Create a metrics scope using Terraform

Create a monitored project and add all Google Cloud projects to it. This lets you view metrics data for groups of projects together in the Monitoring page of the Google Cloud console.

Set up Helm

Use the Helm package manager to define Kubernetes resources.

Set up Artifact Registry

Use Artifact Registry to store and manage your packages and Docker container images. Configure Artifact Registry to store artifacts from Cloud Build and deploy artifacts to GKE.

Consider using the same location for your Artifact Registry and GKE cluster for the following reasons:

You aren't billed for network egress from Artifact Registry to GKE.
Lower latency pulling from Artifact Registry.
You can use image streaming to pull container images.

Set up Binary Authorization

Use Binary Authorization to provide software supply-chain security.

To get started, see Set up Binary Authorization for GKE.

Cluster config

When you create a cluster, you should choose cluster configuration options that meet your requirements. Several key features, such as cluster isolation mode, are immutable. This means that you cannot modify the feature without re-creating the cluster. Other features, such as node auto-provisioning, which helps you scale your cluster and sustain workloads, can be modified after cluster creation.

Choose a mode of operation

GKE has two modes of operation, the fully-managed Autopilot mode and the more configurable Standard mode. We recommend using Autopilot for most production workloads unless you need granular control over your clusters.

For more information, see Autopilot overview.

Isolate your cluster

Use private clusters to minimize control plane exposure.

For more information, see Choose a private cluster type.

Configure backup for GKE

Use backup for GKE to backup and restore workloads in GKE clusters.

For more information, see Install backup for GKE.

Use Container-Optimized OS node images

Use Container-Optimized OS node images, which are optimized to enhance node security.

For more information, see Available node images

Enable node auto-provisioning

Enable node auto-provisioning to automatically manage and scale node pools.

Node auto-provisioning is enabled by default in Autopilot clusters.

For more information, see Enabling node auto-provisioning

Separate your `kube-system` Pods

Separate your kube-system Pods from your workloads to prevent your cluster from having scale down issues on underutilized nodes that run kube-system Pods.

For more information, see Separating kube-system pods from your workloads.

Security

How you configure your cluster security features impacts your ability to keep your cluster secure and reduce the potential attack surface. For example, the security posture dashboard provides continuous scanning for vulnerabilities and common misconfigurations in your workloads.

In addition to the steps in this section, review the GKE hardening guide and apply any additional security controls that are relevant to your organization's security needs.

Use the security posture dashboard

Use the security posture dashboard to automate detection and reporting of common security concerns across multiple clusters and workloads, with minimal intrusion and disruption to your running applications.

Security posture is enabled by default in Autopilot clusters in GKE versions 1.27 and later. For more information, see About the security posture dashboard.

Use group authentication

Use groups to manage your users.

For more information, see Google Groups for RBAC.

Use RBAC to restrict access to cluster resources

Assign the appropriate IAM roles for GKE to groups and users to provide permissions at the project level and use RBAC to grant permissions on a cluster and namespace level.

For more information, see Use namespaces and RBAC to restrict access to cluster resources.

Enable Shielded GKE Nodes

Enable Shielded GKE Nodes to increase the security of GKE nodes.

Shielded GKE Nodes are enabled by default in Autopilot clusters.

For more information, see Enable Shielded GKE Nodes.

Enable Workload Identity Federation for GKE

Authenticate to Google APIs using Workload Identity Federation for GKE.

Workload Identity Federation for GKE is enabled by default in Autopilot clusters. In Autopilot clusters, you can configure your applications to use Workload Identity Federation for GKE without needing to enable it first.

For more information, see Enable Workload Identity Federation for GKE.

Enable security bulletin notifications

Enable security bulletin notifications to receive notifications when security bulletins are available that are relevant to your cluster.

For more information, see Enable security bulletin notifications.

Use least privilege Google service accounts

Create a minimally privileged service account for your nodes to use instead of the Compute Engine default service account.

For more information, see Use least privilege Google service accounts.

Restrict network access to the control plane and nodes

Limit the exposure of your cluster control planes to the internet by using private nodes.

For more information, see Restrict network access to the control plane and nodes

Restrict access to cluster API discovery

Configure authorized networks to restrict access to your cluster's discovery APIs.

For more information, see Restrict access to cluster API discovery.

Use namespaces to restrict access to cluster resources

Give teams least-privilege access to Kubernetes by creating separate namespaces or clusters for each team and environment.

For more information, see Use namespaces and RBAC to restrict access to cluster resources.

Networking

How you configure your networking environment in Google Cloud impacts your ability to scale, right-size, and troubleshoot your GKE clusters. For example, you must consider how many unique IP addresses that you need before you create a cluster. Otherwise, you might run out of IP addresses when your cluster expands.

Create a custom mode VPC

Create a custom mode Virtual Private Cloud (VPC) to choose IP address ranges that won't overlap with other IP address ranges in your environment.

For more information, see Use custom subnet mode.

Create a proxy-only subnet

Create a proxy-only subnet to expose Services with internal Application Load Balancers.

For more information, see Create a load balancer subnet.

Configure Shared VPC

Use Shared VPC to delegate responsibilities, such as creating and managing instances, to Service Project Admins while maintaining centralized control over network resources such as subnets, routes, and firewalls.

For more information, see Use Shared VPC networks.

Connect the cluster's VPC network to an on-premises network

Use Cloud VPN tunnels or Cloud Interconnect VLAN attachments to connect your cluster's VPC network to an on-premises network or other network from which you plan to administer your cluster. This lets you administer the cluster using tools like kubectl by connecting to the control plane private endpoint.

Make sure that the subnet IP address ranges in your cluster's VPC network don't conflict with the IP address ranges used in the network to which you connect.

Enable Cloud NAT

Enable Cloud NAT so that private clusters can connect to internet IP address destinations and external IP addresses for Google Cloud resources, such as external load balancers.

To get started, see Set up Cloud NAT with GKE.

For more information, see Use Cloud NAT for internet access from private clusters.

Configure Cloud DNS for GKE

Enable Cloud DNS for GKE to improve DNS scalability on your cluster.

For more information, see Use Cloud DNS for GKE.

Configure NodeLocal DNSCache

Improve DNS performance by enabling NodeLocal DNSCache.

For more information, see Enable NodeLocal DNSCache.

Create firewall rules

If you create firewall rules to block all access to GKE clusters or virtual machines, you must manually create firewall rules to allow GKE clusters to function.

For more information, see Automatically created firewall rules.

Multi-tenancy

Multi-tenancy allows Kubernetes cluster operators or administrators to design and operate clusters that are split across multiple tenants. In this case, tenant refers to an area of your organization, such as a team, business unit, or application.

Enable multi-tenancy

Enable multi-tenancy to create tenants using the Google Cloud console.

For more information, see Set up multi-tenancy.

Provide tenant-specific logs

Use Logging Log Router to provide tenants with log data specific to project workloads.

For more information, see the following resources:

Create folders and projects

Use folders and projects to capture how your organization manages Google Cloud resources and to enforce a separation of concerns.

For more information, see Establish a folder and project hierarchy.

Configure access control

Assign the appropriate Identity and Access Management (IAM) roles to each group in your organization based on their scope of operations.

For more information, see Assign roles using IAM.

Enforce resource quotas

Enforce resource quotas to ensure all tenants that share a cluster have fair access to the cluster resources.

For more information, see Enforce resource quotas.

Isolate tenants using namespaces

Create namespaces to provide a logical isolation between tenants that are on the same cluster.

For more information, see Create namespaces.

Monitoring

To keep your GKE clusters healthy, optimized, and scalable over time, you should collect and use system metrics.

Configure GKE alert policies

Get started with monitoring GKE by enabling default GKE alert policies. To enable the alert policy, provide a notification channel.

For more information, see the following resources:

Enable Google Cloud Managed Service for Prometheus

Enable Google Cloud Managed Service for Prometheus to monitor and alert on your workloads. This step is not necessary for Autopilot clusters, because Autopilot automatically deploys Google Cloud Managed Service for Prometheus.

Google Cloud Managed Service for Prometheus lets you collect metrics from cluster nodes, DaemonSets, and the control plane. You can use these metrics for tasks such as tracking the number of objects in etcd to predict when a cluster will reach a limit or monitoring the number of API requests per second to prevent a workload from causing problems for the entire cluster.

For more information, see Get started with managed collection.

Configure control plane metrics

Collect Kubernetes control plane metrics using Cloud Monitoring.

For more information, see Use control plane metrics.

Enable metrics packages

We recommend that you configure the following packages:

For more information, see Configuring rules and alerts using Terraform.

Maintenance

No production cluster can exist without planned and configured maintenance operations. This section shows you how to configure your cluster and workloads to allow for upgrades and maintenance operations so that you can apply security fixes without service interruption and so that application teams can adopt new Kubernetes features when needed.

Create environments

Use multiple environments to minimize risk and downtime. At minimum, create a production environment and a pre-production or test environment.

For more information, see Set up multiple environments.

Subscribe to Pub/Sub events

Use Pub/Sub to proactively receive updates about GKE upgrades.

For more information, see Receive updates about new GKE versions.

Enroll in release channels

Enroll your cluster in a release channel to keep your clusters up-to-date with the latest GKE and Kubernetes updates.

For more information, see Enroll clusters in release channels.

Configure maintenance windows

Create maintenance windows and exclusions to increase upgrade predictability and to align upgrades with off-peak business hours.

Create a windows and maintenance exclusion by adding the following block to your Terraform configuration:

maintenance_policy {
  recurring_window {
    start_time = "WINDOW_START_TIME"
    end_time = "WINDOW_END_TIME"
    recurrence = "RECURRENCE"
  }
  maintenance_exclusion{
    exclusion_name = "Batch job"
    start_time = "EXCLUSION_START_TIME"
    end_time = "EXCLUSION_END_TIME"
    exclusion_options {
      scope = "SCOPE"
    }
  }
}

Replace the following:

WINDOW_START_TIME and WINDOW_END_TIME: When the recurring window should start and end. For example, 2022-01-01T00:00:00Z.
RECURRENCE: Recurrence. For more information, see the Terraform docs for google_container_cluster. For example, FREQ=DAILY.
EXCLUSION_START_TIME and EXCLUSION_END_TIME: When the maintenance exclusion should start and end. For example, 2022-01-01T00:00:00Z.
SCOPE: The scope for the exclusion. For example, NO_UPGRADES.

For more information, see Schedule maintenance windows and exclusions.

Set Compute Engine quotas

Review your default Compute Engine quotas to ensure your GKE clusters have enough resources.

For more information, see Compute Engine quotas and best practices.

Configure cost controls

Use GKE cost allocation to view a breakdown of your GKE cluster costs.

To collect billing data, set up Cloud Billing data export to BigQuery.

For more information, see Best practices for running cost-optimized Kubernetes applications on GKE and Best practices for enterprise multi-tenancy.

Configure a billing budget

Create a Cloud Billing budget to monitor your Google Cloud charges and track your costs.

For more information, see Create, edit, or delete budgets and budget alerts.

Summary of best practices

The following table summarizes the best practices recommended in this document:

Onboarding area	Task
Preparation	Document your architecture Configure landing zone properties
Environment	Set up Terraform Store Terraform state Create a metrics scope using Terraform Set up Helm Set up Artifact Registry Set up Binary Authorization
Cluster config	Choose a mode of operation Isolate your cluster Configure backup for GKE Use Container-Optimized OS node images Enable node auto-provisioning Separate `kube-system` Pods
Security	Use the security posture dashboard Use group authentication Use RBAC to restrict access to cluster resources Enable Shielded GKE Nodes Enable Workload Identity Federation for GKE Enable security bulletin notifications Use least privilege Google service accounts Restrict network access to the control plane and nodes Restrict access to cluster API discovery Use namespaces to restrict access to cluster resources
Networking	Create a custom mode VPC Create a proxy-only subnet Configure Shared VPC Connect the cluster's VPC network to an on-premises network Enable Cloud NAT Configure Cloud DNS for GKE Configure NodeLocal DNSCache Create firewall rules
Multi-tenancy	Enable multi-tenancy Provide tenant-specific logs Create folders and projects Configure access control Enforce resource quotas Isolate tenants using namespaces
Monitoring	Configure GKE alert policies Enable Google Cloud Managed Service for Prometheus Configure control plane metrics Enable metrics packages
Maintenance	Create environments Subscribe to Pub/Sub events Enroll in release channels Configure maintenance windows Set Compute Engine quotas Configure cost controls Configure billing alerts