Configure the eviction policies for VMs in GKE on Bare Metal

This document is for platform administrators that run virtual machines (VMs) in GKE on Bare Metal. This document shows you how to configure the eviction policy that controls how VMs that use VM Runtime on Google Distributed Cloud can automatically migrate to other hosts during maintenance events. These configuration settings let you VMs continue to run on other hosts when you upgrade a cluster or perform maintenance on a host.

Before you begin

To complete this document, you need access to GKE on Bare Metal version 1.12.0 (anthosBareMetalVersion: 1.12.0) or higher cluster. You can use any cluster type capable of running workloads. If needed, try GKE on Bare Metal on Compute Engine or see the cluster creation overview.

Eviction policy settings

GKE on Bare Metal lets you configure node behavior when you perform cluster upgrades or place nodes into maintenance mode. VM Runtime on Google Distributed Cloud uses an eviction policy that controls how the cluster handles VMs that run during these events.

When enabled, VM Runtime on Google Distributed Cloud uses live migration to move VM workloads to another host by default. This behavior minimizes potential disruption to the VM workloads. With live migration, migratable VMs move from one host to another without disruption to the guest OS.

The following eviction policy settings can be configured:

Property Default Valid values
evictionStrategy LiveMigrate LiveMigrate, Restart
maxMigrationAttemptsPerVM 3 Integers in the range [1,5]
migrationTargetInitializationTimeout 30s Any valid duration string, such as 1m, 2h, or 1h30m

evictionStrategy

If this value is set to LiveMigrate, VM Runtime on Google Distributed Cloud attempts to migrate any migratable VM away from a node to evict it. The VM is placed on another node in the cluster and continues to run.

If this value is set to Restart, VM Runtime on Google Distributed Cloud makes no attempt to migrate VMs. Instead, VMs are stopped when the node is put into maintenance mode, and the VMs start again when the node is brought out of maintenance mode. The VM is unavailable during the host maintenance event.

When you choose the LiveMigrate strategy, VM Runtime on Google Distributed Cloud falls back to the Restart strategy for non-migratable VMs.

maxMigrationAttemptsPerVM

This setting controls the maximum number of migration attempts that can occur for a particular VM before the Restart strategy applies.

This setting is ignored if evictionStrategy is set to Restart.

migrationTargetInitializationTimeout

This setting controls the maximum amount of time a migration can take to start before it's considered to have failed. A migration that takes too long to start counts toward the failed migration budget for a VM workload.

This setting is ignored if evictionStrategy is set to Restart.

Enable and configure the eviction policy

To use the eviction policy in VM Runtime on Google Distributed Cloud, the evictionPolicy setting must be defined in your cluster's VMRuntime object. This setting is enabled by default in Anthos clusters on bare metal version 1.12 and higher. If you upgrade from a previous version, you must enable this setting manually.

An empty object, such as evictionPolicy: {}, enables the eviction policy with default settings. To change the default behavior, add additional configuration options to the evictionPolicy object.

To enable and configure the eviction policy, complete the following steps:

  1. Edit the VMRuntime custom resource:

    kubectl edit vmruntime vmruntime
    
  2. Add or update the evictionPolicy section with an empty object to use the default eviction policy settings, or add your desired settings:

    apiVersion: vm.cluster.gke.io/v1
    kind: VMRuntime
    metadata:
      name: vmruntime
    spec:
      disableCDIUploadProxyVIP: false
      enabled: true
      evictionPolicy: {}
      useEmulation: false
    

    If the evictionPolicy isn't populated, such as evictionPolicy: null or isn't included in the manifest, no special action is taken. VMs are forcefully evicted.

    The following example manifests show how to enable live migration of VMs, or restart the VMs on the same host after maintenance is complete:

    Live migrate

    The following example eviction policy enables live migration of VMs. The cluster can retry the live migration up to 3 times, with a 30-second timeout on the process each time to start:

    apiVersion: vm.cluster.gke.io/v1
    kind: VMRuntime
    metadata:
      name: vmruntime
    spec:
      disableCDIUploadProxyVIP: false
      enabled: true
      evictionPolicy:
        evictionStrategy: LiveMigrate
        maxMigrationAttemptsPerVM: 3
        migrationTargetInitializationTimeout: 30s
      useEmulation: false
    

    Restart

    The following example eviction policy makes no attempt at migration for any VMs during eviction. VMs are stopped on their current host, and start again when the host finishes its maintenance event:

    apiVersion: vm.cluster.gke.io/v1
    kind: VMRuntime
    metadata:
      name: vmruntime
    spec:
      disableCDIUploadProxyVIP: false
      enabled: true
      evictionPolicy:
        evictionStrategy: Restart
      useEmulation: false
    
  3. Save and close the VMRuntime custom resource in your editor.

    VM Runtime on Google Distributed Cloud now applies the eviction policy to VMs in your cluster.

What's next