Google Kubernetes Engine reliability guide

Last reviewed 2023-07-19 UTC

Google Kubernetes Engine (GKE) is a system for operating containerized applications in the cloud, at scale. GKE deploys, manages, and provisions resources for your containerized applications. The GKE environment consists of Compute Engine instances grouped together to form a cluster.

Best practices

  • Best practices for operating containers - how to use logging mechanisms, ensure containers are stateless and immutable, monitor applications, and do liveness and readiness probes.
  • Best practices for building containers - how to package a single application per container, handle process identifiers (PIDs), optimize for the Docker build cache, and build smaller images for faster upload and download times.
  • Best practices for Google Kubernetes Engine networking - use VPC-native clusters for easier scaling, plan IP addresses, scale cluster connectivity, use Google Cloud Armor to block Distributed Denial-of-Service (DDoS) attacks, implement container-native load balancing for lower latency, use the health check functionality of external Application Load Balancers for graceful failover, and use regional clusters to increase the availability of applications in a cluster.
  • Prepare cloud-based Kubernetes applications - learn the best practices to plan for application capacity, grow application horizontally or vertically, set resource limits relative to resource requests for memory versus CPU, make containers lean for faster application startup, and limit Pod disruption by setting a Pod Disruption Budget (PDB). Also, understand how to set up liveness probes and readiness probes for graceful application startup, ensure non-disruptive shutdowns, and implement exponential backoff on retried requests to prevent traffic spikes that overwhelm your application.
  • GKE multi-tenancy best practices - how to design a multi-tenant cluster architecture for high availability and reliability, use Google Kubernetes Engine (GKE) usage metering for per-tenant usage metrics, provide tenant-specific logs, and provide tenant-specific monitoring.