Configure PDB violation timeout

This document shows how to configure a timeout value in case the draining of a cluster node is in violation of a Pod disruption budget (PDB).

When a node is drained, all Pods on the node must be terminated. By default, if the termination of a Pod is in violation of a PDB, the draining of the node is blocked.

In some situations, you might want to configure a maximum time that the draining of a node can be blocked by a PDB violation. For example, you might want to configure a timeout value before you start a cluster update or upgrade. Or you might need to configure a timeout value for a node that is currently blocked from draining by a PDB violation.

Set a timeout value

Each node is represented by a Machine object.

List the Machine objects in the cluster:

kubectl --kubeconfig CLUSTER_KUBECONFIG get machines

Replace CLUSTR_KUBECONIFG with the path of the cluster kubeconfig file.

Example output:

my-node-pool-7f864959cd-cw472
my-node-pool-7f864959cd-kh86m
my-node-pool-7f864959cd-wtpvx

Open a Machine object for editing:

kubectl --kubeconfig CLUSTER_KUBECONFIG edit machine MACHINE_NAME

Replace MACHINE_NAME with the name of the Machine object.

In the editor, add this annotation:

onprem.cluster.gke.io/pdb-violation-timeout: TIMEOUT

Replace TIMEOUT with a string that specifies the duration of the timeout. Valid time units are "s", "m", "h". Examples of time values are "1h", "1h30m", "10m", and "100s".

If you set the timeout value to "0s", then PDB violations will never time out. This is the same as the default behavior.

Example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  annotations:
    kubelet-version: 1.23.5-gke.1502
    onprem.cluster.gke.io/gke-on-prem-version: 1.12.0-gke.430
    vm-ip-address: 203.0.113.2
    onprem.cluster.gke.io/pdb-violation-timeout: "5m"

Close the editing session.

Rolling updates

During a rolling update, a new surge machine is created first. Then the old node is drained, and after all the Pods on it have been evicted, the old Machine object and Node object are both deleted. The PDB violation timeout annotation does not persist on the newly created Machine object.