About GKE Pod snapshots

Standard

Google Kubernetes Engine (GKE) Pod snapshots help improve workload startup latency by restoring snapshots of running Pods. A Pod snapshot saves the entire Pod state, including memory and file system changes. When you create new replicas, they're restored from the snapshot, allowing the workload to resume instead of starting from a new state.

This document provides a conceptual overview of GKE Pod snapshots. To learn how to enable and use this feature, see Restore from a Pod snapshot.

When to use Pod snapshots

Use Pod snapshots for workloads that have long initialization times, for example AI inference workloads that load large models into CPU or GPU memory, or large applications that load many libraries and dependencies. Workloads that already have fast startup times generally won't benefit from Pod snapshots.

How Pod snapshots work

GKE Pod snapshots store an exact copy of a Pod's process state at a specific point in time. When new replicas are created, instead of initializing the Pod from a fresh state, the Pod is restored from a snapshot, resuming execution from the point the snapshot was taken.

To use Pod snapshots, you create Kubernetes custom resource definitions (CRDs) to declaratively configure snapshot behavior. An agent running on each GKE node manages the snapshot lifecycle. Based on the policies that you define, the agent determines when to create new snapshots and when to use existing snapshots to restore new Pods. A controller running on the GKE control plane cleans up obsolete snapshots and resolves issues. Cloud Storage stores your Pod snapshots.

Custom resource definitions

Pod snapshots are configured declaratively with two CRDs:

PodSnapshotStorageConfig: specifies the storage location for snapshots. Only Cloud Storage buckets are supported.
PodSnapshotPolicy: defines which Pods to snapshot based on Kubernetes label selectors. This resource contains the majority of configuration options for the feature, including snapshot triggers and retention policies.

Snapshot matching

Pod matching determines if a Pod snapshot is compatible with a specific Pod. This match is achieved by creating a unique hash from the Pod's essential runtime specifications, also called the distilled Pod spec. This hash is then embedded within the Pod snapshot. For a later Pod to be restored from this Pod snapshot, it must generate an identical hash from its own distilled Pod spec. This process helps ensure that the checkpointed and restored Pods are identical in their runtime configurations.

Distillation simplifies the Pod specification by retaining only the critical runtime fields, such as image, while removing non-essential fields like nodeName or nodeSelector. You must ensure that the values of these essential fields are consistent between the Pod used for checkpointing and the Pod intended for restoration.

The following fields from the Pod object influence the unique hash:

metadata:
- annotations: only annotations that are relevant to the gVisor runtime, such as annotations that start with the dev.gvisor.* prefix.
- labels: batch.kubernetes.io/job-completion-index
spec:
- volumes: name, volumeSource, hostPath, persistentVolumeClaim, configMap
- containers:
  - name
  - image
  - command
  - args
  - workingDir
  - ports: name, containerPort, protocol
  - volumeMounts: name, readOnly, recursiveReadOnly, mountPath, subPath, mountPropagation, subPathExpr
  - volumeDevices: name
  - lifecycle: postStart, preStop
  - terminationMessagePath
  - terminationMessagePolicy
  - securityContext (and all sub-fields)
  - stdin
  - stdinOnce
  - tty
- initContainers: same sub-fields as containers.
- dnsPolicy
- automountServiceAccountToken
- hostNetwork
- hostPID
- hostIPC
- shareProcessNamespace
- securityContext
- dnsConfig
- runtimeClassName
- os
- hostUsers

The following additional criteria must match to be considered a compatible snapshot:

Hardware: the new Pod must run on a node that has an identical machine series and architecture as the original Pod. The machine series and architecture must be the same. The number of CPUs and amount of memory can change. E2 machine types are not supported because of their dynamic underlying architecture.
Versioning: the gVisor kernel version and the GPU driver version must match.

GKE manages the snapshot compatibility. If GKE finds a compatible snapshot, GKE restores the new Pod from the snapshot. If no compatible snapshot exists, the Pod starts normally.

Restore readiness and background loading

When a Pod is restored from a snapshot, the gVisor kernel is restored first, which typically takes a few seconds. To minimize startup latency, the application resumes immediately after the kernel is restored. It does not wait for the application memory to be fully loaded. Application memory is restored by using a background streaming mechanism.

If the application tries to access a part of memory that hasn't been loaded yet, a page fault occurs. gVisor intercepts this fault, pauses the application thread, and immediately fetches the required memory page from storage. This on-demand fetching is prioritized over the background stream.

Because of this background loading, memory access might have a small amount of latency for the first few seconds after a restore if the application needs memory that hasn't been streamed yet. This latency disappears when the memory state is fully synchronized.

This background loading behavior also applies to the GPU state. For example, a large language model (LLM) Pod might appear to be in the Running state and respond to network checks even though its GPU memory is still being populated. The model won't be fully responsive for inference until the GPU state is completely restored. Because of this, when measuring restoration speed, ensure that you capture when the model server has started. You can check when the model server starts by using metrics like Time-to-First-Token (TTFT) or Pod readiness probes.

GPU state

Pod snapshots support capturing the state of GPUs. When you trigger a snapshot for a Pod that uses GPUs, the NVIDIA cuda-checkpoint tool saves the GPU state into process memory. This means that any data stored on the GPU, for example model weights, are includes in the snapshot. The Pod is then paused and snapshotted. During restore, the process is reversed.

Because GPU state is written into process memory, Pod memory usage increases during snapshot and restore operations. You should account for this additional memory requirement when you set memory limits for your Pods.

Considerations for restored Pods

From the perspective of the Kubernetes API, a new Pod is created. When the Pod starts, if there is a corresponding snapshot for the Pod, the Pod is restored from that snapshot, including the original memory and process state. However, some aspects of the Pod's state must change for it to function as a new, unique instance.

Consider the following state changes after a restore:

Network interfaces: the restored Pod receives a new IP address. All network interfaces and routes are reconfigured. Active network connections that existed at the time of the snapshot are closed upon restore. Listening sockets continue to function.
Hostname: the restored Pod assumes a new identity and receives a new hostname.
Application state: application state that must be unique for each Pod, such as experiment IDs or random number seeds, must be re-initialized after a restore.
Secrets: encryption keys and certificates created before the snapshot is taken must be re-created.
Environment variables: You can change environment variables between a snapshot and a restore. However, because environment variables are stored in application memory, GKE Sandbox can't reliably find and replace them. If your workload relies on new environment variables after a restore, the Pod must manually refresh them. The new environment variables are available in the /proc/gvisor/spec_environ file. The file format is the same as /proc/<pid>/environ.

State that changes after restore

Not all state is retained upon restore. The following parts of the Pod state change so that the Pod can assume a new identity:

Network interfaces: The restored Pod receives a new IP address. All interfaces and routes are reconfigured. Active network connections that existed at the time of the snapshot are closed upon restore. Listening sockets, loopback connections, and Unix-Domain socket connections continue to function.
Hostname: The restored Pod assumes a new identity and receives a new hostname.
Wall-clock time: The wall-clock time jumps ahead to the current time.

Limitations and requirements

GKE Pod snapshots have the following limitations:

Pods must run in GKE Sandbox because Pod snapshots depend on the gVisor container runtime that GKE Sandbox provides.
Pod snapshots don't support E2 machine types.
Pod snapshots work with single-GPU Pods. Only the following multi-GPU configurations are supported:
- g2-standard-4 (1 x L4)
- g2-standard-8 (1 x L4)
- g2-standard-12 (1 x L4)
- g2-standard-16 (1 x L4)
- g2-standard-32 (1 x L4)
- g2-standard-48 (4 x L4)
- g2-standard-96 (8 x L4)
- a2-highgpu-1g (1 x A100-40GB)
- a2-ultragpu-1g (1 x A100-80GB)
- a3-highgpu-1g (1 x H100-80GB)
Partial GPU usage is not supported. If a node has multiple GPUs, a Pod must use all of them. For example, you can't use Pod snapshots with four Pods that each use one GPU on a four-GPU machine.
Using the Cloud Storage FUSE CSI driver sidecar container with Pod snapshots is not supported.

What's next

To learn how to use Pod snapshots, see Restore from a Pod snapshot.