About repairing VMs for high availability

This document describes how a managed instance group (MIG) provides high availability of your application by repairing failed and unhealthy VMs in the group.

A MIG keeps your application up and available by proactively maintaining the number of running VMs in the group. If a VM in the group goes down, the MIG repairs the VM by recreating it in the following ways to bring the VM back to service:

Automatically repair a failed VM: If a VM fails or is deleted by an action not initiated by the MIG, then the MIG automatically repairs the failed VM. In this document, see Automatically repair a failed VM.
Repair a VM based on an application health check: An optional way to further improve high availability by repairing unhealthy VMs. If you configure an application-based health check and your application fails the health check, then the MIG marks that VM as unhealthy and repairs it. Repairing a VM based on an application health check is also called autohealing. In this document, see Repair a VM based on an application health check.

Automatically repair a failed VM

If a VM in a MIG fails, the MIG automatically repairs the failed VM by recreating it. A VM can fail due to the following reasons:

Unexpected reasons like a hardware failure.
Actions not initiated by the MIG, such as the following:
- Preemption of a Spot VM.
- Infrastructure maintenance events when the VM instance is not set to live migrate.
- Actions performed directly on a VM using the VM instances console page, instances gcloud CLI commands, or instances API resource. For example, stopping a VM in the group using the instances.stop method or the gcloud compute instances stop command triggers repairing.

If the MIG intentionally stops a VM—for example, when an autoscaler deletes a VM—then the MIG doesn't repair that VM.

Repair a VM based on an application health check

In addition to the automatic repair of failed VMs, you might want to repair a VM if your application running on the VM freezes, crashes, or runs out of memory. To ensure that the application is responding as expected, you can configure an application-based health check.

An application-based health check periodically verifies that your application on each VM in a MIG is responding as expected. If the application on a VM doesn't respond, then the MIG marks that VM as unhealthy. The MIG then repairs the unhealthy VM. Repairing a VM based on an application health check is called autohealing.

To ensure that the MIG keeps running a subset of its VMs, the group never concurrently autoheals all of its VMs. This is useful if, for example, an incorrect health check triggers unnecessary repairs, a misconfigured firewall rule prevents a health check from probing the VM, or there are network connectivity or infrastructure issues that misidentify a healthy VM as unhealthy. However, if a zonal MIG has only one VM, or a regional MIG has only one VM per zone, a MIG autoheals these VMs when they become unhealthy.

Autohealing policy

Each MIG has an autohealing policy in which you can configure a health check and also set an initial delay. The initial delay is the time that a new VM takes to initialize and run its startup script. The initial delay timer starts when the MIG changes the VM's currentAction field to VERIFYING. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready.

For more information about configuring an autohealing policy, see Set up an application health check and autohealing.

Monitor application health state changes

If you've configured an application-based health check in your MIG, you can check the health state of each VM in the MIG. For more information, see Check whether VMs are healthy.

You can also monitor the changes in the health state of a VM. For more information, see Monitor health state changes.

Pricing

When you set up an application-based health check, by default Compute Engine writes a log entry whenever a managed instance's health state changes. Cloud Logging provides a free allotment per month after which logging is priced by data volume. To avoid costs, you can disable the health state change logs.

Behavior during a repair

The following sections explain the behavior during automatic repairs and repairs based on application health check.

Update on repair

By default, during a repair, a MIG recreates a VM using the original instance template that was used to create the VM. For example, if a VM was created using instance-template-a and then you update the MIG to use instance-template-b in OPPORTUNISTIC mode, the MIG still uses instance-template-a to recreate the VM.

If you want your MIG to use the latest instance template and per-instance configurations during VM repair, you can configure the group to apply configuration updates during repairs.

Disk handling

During a repair, when recreating a VM based on its template, the MIG handles different types of disks differently. Some disk configurations can cause a repair to fail when attempting to recreate a VM.

Disk type	`autodelete`	Behavior during a repair
New persistent disk	`true`	Disk is recreated as specified in the instance template. Any data that was written to that disk is lost when the disk and its VM are recreated.
New persistent disk	`false`	Disk is preserved and reattached when MIG recreates the VM.
Existing persistent disk	`true`	Old disk is deleted. VM recreate operation fails because Compute Engine cannot reattach a deleted disk to the VM. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.
Existing persistent disk	`false`	Old disk is reattached as specified in the instance template. The data on the disk is preserved. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.
New local SSD	N/A	Disk is recreated as specified in the instance template. The data on a local SSD is lost when a VM is recreated or deleted.

The MIG does not reattach disks that are not specified in the instance template or per-instance configurations, such as disks that you attached to a VM manually after the VM was created.

To preserve important data that was written to disk, take precautions, such as the following:

Take regular persistent disk snapshots.
Export data to another source, such as Cloud Storage.
Configure stateful persistent disks.

If your VMs have important settings that you want to preserve, Google also recommends that you use a custom image in your instance template. A custom image contains any custom settings that you need. When you specify a custom image in your instance template, the MIG recreates VMs using the custom image that contains the custom settings you need.

Turn off repairs

You can turn off repairs that are automatically done by a MIG. When you turn off repairs, repairing of failed VMs and repairing based on an application health check are turned off.

You might want to turn off repairs in a MIG in scenarios such as the following:

To investigate or debug a failed VM without interruption from automatic repair.
To repair VMs manually or implement your own repair logic.
To prevent registering new VMs while a batch job is in progress.
To observe application health states without repairing an unhealthy VM.
To fine-tune health check configuration without false-triggering repairs.

When you turn off repairs, the MIG doesn't take any action if a VM in the group fails or becomes unhealthy. Failed and unhealthy VMs continue to be in the group and the target number of running VMs in the MIG (targetSize) remains the same.

If the MIG's update type is set to proactive and a new instance template is available, then the MIG tries to update the failed and unhealthy VMs.

If you've configured an application-based health check, turning off repairs doesn't affect the functioning of the health check. The health check continues to probe the application and provide the VM health states. This lets you monitor application health states while preventing the MIG from repairing unhealthy VMs.

If the MIG is part of a backend service of a load balancer and you turn off repairs in the MIG, any unrepaired failed and unhealthy VMs don't respond to the load balancer health check. If the number of these failed or unhealthy VMs in the MIG increases, the load balancer might reduce traffic to that MIG or switch to another backend, if configured. When the failed VMs become available again, the load balancer resumes the traffic to the MIG.

For more information, see Turn off repairs in a MIG.