Graceful node shutdown in Anthos clusters on Azure
Starting from version 1.26, Anthos clusters on Azure automatically enables Graceful Node Shutdown. This feature manages the graceful termination of Pods during node shutdowns. Graceful termination allows Pods to save their state, and to release resources before the node is shut down. This method of terminating Pods minimizes the risk of data loss. It also minimizes the risk of interruptions to other Pods and services that rely on or interact with the Pods being shut down, thus enhancing the resilience of your clusters.
How it works
An event such as a scheduled maintenance, node scaling, or a hardware issue, triggers a node shutdown. The kubelet component detects the event and initiates the graceful node termination process by instructing systemd to delay the system shutdown for a specified duration. This delay give the node time to drain and evict the Pods running on it.
The goal of graceful node termination is to gracefully terminate both non-system and critical system Pods before the node shuts down, and the following default settings are used:
ShutdownGracePeriod: 30 seconds
ShutdownGracePeriodCriticalPods: 15 seconds
These settings give non-system Pods 15 seconds to gracefully terminate before they are forcibly stopped, and critical system Pods have 15 seconds to shut down after the non-system Pods have terminated. However, since the feature operates on a best-effort basis, there is a possibility that a node may not be able to shut down gracefully within the designated 30-second period.
Triggers and limitations
Events that trigger graceful node shutdown include planned events such as the following:
- User-commanded shutdowns
- Termination of instances
- Scheduled maintenance
- Scaling down a cluster
In these scenarios, the kubelet detects the node shutdown event and initiates the graceful node shutdown process.
In contrast, graceful node shutdown can't be activated when the shutdown command doesn't trigger the systemd inhibitor lock mechanism that the kubelet component relies on. Examples of those kinds of situations include the following:
- Network disconnections
- Hardware malfunctions
- Insufficient resources such as memory or CPU
- Unexpected power outages.
In these cases, the node may shut down abruptly, potentially causing disruptions or data loss.