Best practices for running batch workloads on GKE

Autopilot Standard

This page introduces the best practices to build batch processing platforms with Google Kubernetes Engine (GKE). GKE provides a powerful framework for orchestrating batch workloads such as data processing, training machine learning models, running scientific simulations, and other high performance computing workloads.

These best practices are intended for platform administrators, cloud architects, and operations professionals interested in deploying batch workloads in GKE.

How batch workloads work

A batch workload is a group of tasks that run to completion without user intervention. To define tasks, you use the Kubernetes Jobs resource. A batch platform receives the Jobs and queues them in the order they are received. The queue in the batch platform applies processing logic such as priority, quota, and allocable resources. By queueing and customizing the batch processing parameters, Kubernetes lets you optimize the use of available resources, minimize the idle time for scheduled Jobs, and maximize cost savings. The following diagram shows the GKE components that can be part of a batch platform.

Batch platform management

Traditionally, batch platforms have two main user personas, developers and platform administrators:

A developer submits a Job specifying the program, the data to be processed, and requirements for the Job. Then, the developer receives confirmation of the Job submission and a unique identifier. Once the Job is complete, the developer would get a notification along with any output or results of the Job.
A platform administrator manages and delivers an efficient and reliable batch processing platform to the developers.

A batch processing platform must meet the following requirements:

The platform resources are properly provisioned to ensure that Jobs run with little to no user intervention required.
The platform resources are configured according to the organization's security and observability best practices.
The platform resources are used as efficiently as possible. In case of resource contention, the most important work gets done first.

Prepare the batch platform architecture in GKE

A GKE environment consists of nodes, which are Compute Engine virtual machines (VMs), that are grouped together to form a cluster.

The following table lists the key recommendations when planning and designing your batch platform architecture:

Recommendation	Resources
Select a GKE mode of operation	GKE has the following modes of operation available: With Autopilot mode, GKE automatically manages your cluster configuration, including your nodes, scaling, security, and other preconfigured settings so you can focus on your workload. Autopilot clusters are highly-available by default. With Standard mode, you define and manage your cluster configuration, including your nodes, scaling, security, and other advanced settings. See the high-level comparison between Autopilot and Standard mode.
Choose the machine type for your nodes	GKE supports the following Compute Engine VMs series: Cost-optimized, such as E2 Balanced, such as N2, N2D, or N1 Scale-out optimized, such as Tau T2D, or Tau T2A Memory-optimized, such as M2 or M1 Compute optimized, such as C2 or C2D Accelerator optimized, such as A2 featuring NVIDIA A100 GPUs, G2 featuring NVIDIA L4 GPUs, A3 featuring NVIDIA H100 GPUs (available in Private preview). Each machine series is associated with one or more CPU platforms, such as Arm processors and x86 processors from Intel and AMD. Learn about the options that are currently available for your workload.
Use hardware accelerators for your nodes	You can also use hardware accelerators such as graphics processing units (GPUs) and Tensor Processing Units (TPUs) in GKE. Consider is GPU time-sharing strategy, which allows multiple containers to share time on the same physical GPU. This approach is useful for burstable and homogenous GPU workloads with low requests. Multi-instance GPUs to partition GPUs to share a single GPU resource across multiple containers at the same time. Learn about the available GPU accelerator strategies, such as Time-sharing and Multi-instance GPUs. Learn about TPUs accelerator strategies.
Enable cluster autoscaler on Standard clusters	GKE automatically resizes the number of nodes in a given node pool based on the demands of your workloads. You don't need to manually add or remove nodes or over-provision your node pools. Instead, you only specify a minimum and maximum size for the node pool. We recommend you set cluster autoscaler with the following configuration: Use `optimize-utilization` profile that removes unused nodes up to three times faster than the balanced profile. To learn more, see Autoscaling profiles. Set location policy to `ANY`. The GKE cluster autoscaler prioritizes utilization of unused reservations and create nodes in any available zone in the regions. To learn more, see Location policy. Enable node auto-provisioning to automatically manage and autoscale your infrastructure. Once a node pool is created using auto-provisioning, cluster autoscaler can dynamically scale the nodepool. To learn more, see How node auto-provisioning works. With Autopilot clusters, you don't need to worry about provisioning nodes or managing node pools because node pools are automatically provisioned through node auto-provisioning, and are automatically scaled to meet the requirements of your workloads.
Enroll your cluster in a release channel	GKE can automatically manage your cluster version and upgrades. Based on your release adoption model, you can enroll your cluster in the GKE available channels. To learn more, see How to choose the best release channel for your clusters
Define a scope of maintenance to exclude for your cluster	With a upgrade scope exclusion window defined, GKE respects that long running batch workloads aren't disrupted for maintenance until completion. To learn more, see Scope of maintenance to exclude.

Manage the Job lifecycle

In Kubernetes, you run your workloads in a set of Pods. Pods are groups of single or multiple containers, with shared storage and network resources. Pods are defined by a Kubernetes specification.

A Job creates one or more Pods and continues to retry execution of the them until a specified number of Pods successfully terminate. As Pods complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the Job is complete.

The following table lists the key recommendations when designing and managing Jobs:

Recommendation	Resources
Select the Job completion mode	Specify the Completion mode as `Indexed`. This configuration is useful when assigning a partition of the data to be processed based on the Pod's index. The Pods of a Job get an associated completion index. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.
Set CronJobs for regular scheduled actions	Use CronJob for GKE to perform regular scheduled actions, such as backups, report generation, or scheduled training for machine learning models.
Manage failures in a Job	Define the Kubernetes Pod failure policy and Pod backoff failure limit to handle retriable and non-retriable failures in a Job. This definition improves the cluster resources consumption by avoiding unnecessary Pod retries and Job failures due to Pod disruptions. For example, you can configure preemption, API-initiated eviction, or taint-based eviction where Pods that don't have a toleration for the NoExecute taint effect are evicted. Learn how to Handle retriable and non-retriable pod failures with Pod failure policy.
Manage multiple Jobs as a unit	Use the JobSet API to manage multiple Jobs as a unit to address workload patterns, such as one driver (or coordinator) and multiple workers (for example, `MPIJob`) while setting Job defaults that are aligned with common patterns based on your use cases. For example, you can create an Indexed Job by default, create a headless service for predictable fully qualified domain names (FQDN) for Pods, and set the associated Pod failure policy.
Extend run time for a Pod that doesn't tolerate restarts	Set the Kubernetes `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation to `false` in the Pod specification. The cluster autoscaler respects the eviction rules set on Pods. These restrictions can prevent a node from being deleted by the autoscaler if it contains a Pod with the `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation. To learn more, see Considering Pod scheduling and disruption.

Manage Multi-tenancy

GKE cluster multi-tenancy is an alternative to managing GKE resources by different users or workloads, named as tenants, in a single organization. The management of GKE resources might follow criteria such as tenant isolation, quotas and limit ranges, or cost allocation.

The following table lists the key recommendations when managing multi-tenancy:

Recommendation	Resources
Use namespaces to manage tenant isolation	You can separate each tenant and their Kubernetes resources into their own namespaces.
Use policies to enforce tenant isolation	Define Policies to restrict API access, set quotas, constrain resource usage, and restrict what containers are allowed to do. These policies are scoped by namespaces.
Set GKE cost allocation	Use GKE cost allocation get insights about cluster resource requests for each tenant on a namespace basis.

Control access to the batch platform

GKE allows you to finely tune the access permissions of the workloads running on the cluster.

The following table lists the key recommendations when managing access and security

Recommendation	Resources
Set Workload Identity Federation for GKE	GKE allows workloads in your GKE cluster to impersonate Identity and Access Management (IAM) service accounts to access Google Cloud services. By using Workload Identity Federation for GKE, workloads can securely access secrets stored outside GKE. To learn more, see Workload Identity Federation for GKE and access secrets stored.
Set cluster network isolation	Use private clusters where both the control plane endpoint and worker nodes can have internal IP addresses. You can also change the cluster isolation for existing public clusters that use Private Service Connect. To learn more, see public clusters and changing cluster isolation.
Use Shielded GKE Nodes	Configure Shielded GKE Nodes to provide strong, verifiable node identity and integrity to increase the security of GKE nodes.
Physical isolation	For security reasons your workloads might need stronger isolation. Control scheduling with node taints to physically separate tenants in node-pools using node-taints and workload tolerations. This ensures that only the appropriate workloads to be scheduled on those node-pools.

Queueing and fair sharing

To control resource consumption, you can assign resource quota limits for each tenant, queue incoming Jobs, and process Jobs in the order they were received.

The following table lists the key recommendations when managing queueing and fair sharing among batch workloads:

Recommendation	Resources
Use Kueue	Kueue is a kubernetes-native Job queueing system for batch, high performance computing, machine learning, and similar applications in a Kubernetes cluster. To help with fair sharing of cluster resources between its tenants, Kueue manages quotas and how Jobs consume them. Kueue makes the following decisions: When a Job should wait When a Job should be admitted to start, for example by creating the Pod When a Job should be preempted, for example by deleting the Pod To learn how to implement a Job queueing system, see Implement a Job queuing system with quota sharing between namespaces on GKE. To learn more about Kueue, see Kueue concepts.

Recommendation

Resources

Use Kueue

Kueue is a kubernetes-native Job queueing system for batch, high performance computing, machine learning, and similar applications in a Kubernetes cluster. To help with fair sharing of cluster resources between its tenants, Kueue manages quotas and how Jobs consume them. Kueue makes the following decisions:

When a Job should wait
When a Job should be admitted to start, for example by creating the Pod
When a Job should be preempted, for example by deleting the Pod

To learn how to implement a Job queueing system, see Implement a Job queuing system with quota sharing between namespaces on GKE.

To learn more about Kueue, see Kueue concepts.

Storage, performance, and cost efficiency

The efficient use of our GKE compute and storage resources can reduce costs. One strategy is to right-size and configure your compute instances to align with your batch processing needs while not sacrificing performance.

The following table lists the key recommendations when designing and managing storage and optimizing performance:

Recommendation	Resources
Use Compute Engine Persistent Disks	We recommend you use the following Compute Engine Persistent Disks configurations: Use Regional persistent disks to allow worker nodes in all zones in a region to access the same underlying durable storage. For workloads that require high-performance storage, use Local SSDs for ephemeral data, or Extreme Persistent Disk for persistent data. Use Hyperdisk Throughput Persistent Disk to independently and dynamically provision throughput and capacity optimized for batch workloads.
Use network-attached storage	Use the following network-attached storage alongside Persistent Disk for optimal storage performance: Use Filestore to allow all worker nodes in a Pod to access the same storage namespace and scale capacity. Use Cloud Storage FUSE to access Cloud Storage directly from a container as a local POSIX mount.
Define Pub/Sub	Your batch workload may also read and write data. For example, you can use Pub/Sub and write the results to a data warehouse such as BigQuery from where reports and dashboards are updated. We recommend you use the following storage solutions: For managed object storage, use Cloud Storage. For managed network file storage, use Filestore. For workloads that require file-system semantics, use Cloud Storage FUSE CSI driver. This driver allows Kubernetes applications to mount Cloud Storage buckets as local file systems
Specify tuning parameters for your workload	We recommend you use the following configurations: Customize your node system configuration for your workload. For example, set the minimum, default, and maximum values of the TCP socket receive buffer. Use node system configuration. Enable busy polling using network profiles. Some workloads that are sensitive to network latency might be improved. Use network profiles. Increase network bandwidth for GKE nodes by enabling Google Virtual NIC (gVNIC), a virtual network interface designed specifically for Compute Engine recommended for high performance applications. Specify the number of threads per core to disable Simultaneous multi-threading.
Optimize your workloads networking and latency	GKE supports compact placement policy for node pools which specifies that these nodes (and thus the workloads running on them) should be placed in closer physical proximity to each other within a zone. This is especially useful for tightly coupled and high performance workloads where low latency between different processes comprising the workload is a major concern. To learn more, see compact placement.
Use Spot VMs	Spot VMs are Compute Engine virtual machine (VM) instances that are priced lower than standard Compute Engine VMs and provide no guarantee of availability. We recommend you use the following solutions: Set autoscaling Spot VMs node pools combined with `location_policy= "ANY"` . With this policy, Spot VMs have a lower risk of being preempted. This combination is especially useful for workloads that can survive the preemption of individual worker nodes, such as pleasingly parallel computations. If your workload has a predictable resource footprint, combine Google Cloud Reservations and committed use discounts may yield significant savings. Create a node pool with its size set to the number of reserved instances and prioritize saturating this node pool for maximum usage.
Use image steaming	Use Image streaming to pull container images. GKE streams data from eligible images. This allows your workloads to initialize without waiting for the entire image to download, leading to significantly improved initialization times and better cost efficiency.

Monitoring

GKE is integrated with observability and logging tools that help you monitor the reliability and efficiency of your cluster. The following table lists the key recommendations when enabling and using GKE observability tools:

Recommendation	Resources
Use Prometheus	GKE is integrated with Google Cloud Observability. Customize the metrics that you want GKE to send to Cloud Logging and Cloud Monitoring Google Cloud Managed Service for Prometheus is enabled for GKE clusters by default. We recommend that you use managed collection to eliminate the complexity of setting up and maintaining Prometheus servers. To learn more, see Managed Service for Prometheus.
Use Cloud Monitoring dashboards	Use the Monitoring dashboards for GKE to see a high level overview of cluster and resource utilization and drill down and filter through various metrics and dimensions. To learn more, see Observing your GKE clusters.

Recommendation

Resources

Use Prometheus

GKE is integrated with Google Cloud Observability. Customize the metrics that you want GKE to send to Cloud Logging and Cloud Monitoring

Google Cloud Managed Service for Prometheus is enabled for GKE clusters by default. We recommend that you use managed collection to eliminate the complexity of setting up and maintaining Prometheus servers.

To learn more, see Managed Service for Prometheus.

Use Cloud Monitoring dashboards

Use the Monitoring dashboards for GKE to see a high level overview of cluster and resource utilization and drill down and filter through various metrics and dimensions.

To learn more, see Observing your GKE clusters.

What's next

Learn how to Deploy a batch system using Kueue
See the Best practices for running cost-optimized Kubernetes applications on GKE