Graceful node shutdown in GKE on Azure
Starting from version 1.26, GKE on Azure automatically enables Graceful Node Shutdown. This feature manages the graceful termination of Pods during node shutdowns. Graceful termination lets Pods save their state, and release resources before the node is shut down. This method of terminating Pods minimizes the risk of data loss. It also minimizes the risk of interruptions to other Pods and services that rely on or interact with the Pods being shut down, thus enhancing the resilience of your clusters.
How it works
An event such as scheduled maintenance, node scaling, or a hardware issue,
triggers a node shutdown. The kubelet
component detects the event and
initiates the graceful node termination process by instructing systemd
to
delay the system shutdown for a specified duration. This delay gives the node
time to drain and evict the Pods running on it.
The goal of graceful node termination is to gracefully terminate both non-system and critical system Pods before the node shuts down. The following default settings are used:
ShutdownGracePeriod
: 30 secondsShutdownGracePeriodCriticalPods
: 15 seconds
These settings give non-system Pods 15 seconds to gracefully terminate before they are forcibly stopped. Critical system Pods have 15 seconds to shut down after the non-system Pods have terminated. However, since the feature operates on a best-effort basis, there's a possibility that a node might not be able to shut down gracefully within the designated 30-second period.
Triggers and limitations
Events that trigger graceful node shutdown include planned events such as the following:
- User-commanded shutdowns
- Termination of instances
- Scheduled maintenance
- Scaling down a cluster
In these scenarios, the kubelet
detects the node shutdown event and initiates
the graceful node shutdown process.
In contrast, graceful node shutdown can't be activated when the shutdown command
doesn't trigger the systemd
inhibitor lock mechanism that the kubelet
component relies on. Examples of those kinds of situations include the
following:
- Network disconnections
- Hardware malfunctions
- Insufficient resources such as memory or CPU
- Unexpected power outages.
In these cases, the node might shut down abruptly, potentially causing disruptions or data loss.