This document describes how a managed instance group (MIG) provides high availability of your application by repairing failed and unhealthy VMs in the group.
A MIG keeps your application up and available by proactively maintaining the number of running VMs in the group. If a VM in the group goes down, the MIG repairs the VM by recreating it in the following ways to bring the VM back to service:
- Automatically repair a failed VM: If a VM fails or is deleted by an action not initiated by the MIG, then the MIG automatically repairs the failed VM. In this document, see Automatically repair a failed VM.
- Repair a VM based on an application health check: An optional way to further improve high availability by repairing unhealthy VMs. If you configure an application-based health check and your application fails the health check, then the MIG marks that VM as unhealthy and repairs it. Repairing a VM based on an application health check is also called autohealing. In this document, see Repair a VM based on an application health check.
Automatically repair a failed VM
If a VM in a MIG fails, the MIG automatically repairs the failed VM by recreating it. A VM can fail due to the following reasons:
- Unexpected reasons like a hardware failure.
- Actions not initiated by the MIG, such as the following:
- Preemption of a Spot VM.
- Infrastructure maintenance events when the VM instance is not set to live migrate.
- Actions performed directly on a VM using the
VM instances console page,
instances
gcloud CLI commands, orinstances
API resource. For example, stopping a VM in the group using theinstances.stop
method or thegcloud compute instances stop
command triggers repairing.
If the MIG intentionally stops a VM—for example, when an autoscaler deletes a VM—then the MIG doesn't repair that VM.
Repair a VM based on an application health check
In addition to the automatic repair of failed VMs, you might want to repair a VM if your application running on the VM freezes, crashes, or runs out of memory. To ensure that the application is responding as expected, you can configure an application-based health check.
An application-based health check periodically verifies that your application on each VM in a MIG is responding as expected. If the application on a VM doesn't respond, then the MIG marks that VM as unhealthy. The MIG then repairs the unhealthy VM. Repairing a VM based on an application health check is called autohealing.
To ensure that the MIG keeps running a subset of its VMs, the group never concurrently autoheals all of its VMs. This is useful if, for example, an incorrect health check triggers unnecessary repairs, a misconfigured firewall rule prevents a health check from probing the VM, or there are network connectivity or infrastructure issues that misidentify a healthy VM as unhealthy. However, if a zonal MIG has only one VM, or a regional MIG has only one VM per zone, a MIG autoheals these VMs when they become unhealthy.
Autohealing policy
Each MIG has an autohealing policy in which you can configure a health check and
also set an initial delay. The initial delay is the time that a new VM takes to
initialize and run its startup script. The initial delay timer starts when the
MIG changes the VM's currentAction
field to VERIFYING
. During a VM's initial delay period, the
MIG ignores unsuccessful health checks because the VM might be in the startup
process. This prevents the MIG from prematurely recreating a VM. If the health
check receives a healthy response during the initial delay, it indicates that
the startup process is complete and the VM is ready.
For more information about configuring an autohealing policy, see Set up an application health check and autohealing.
Monitor application health state changes
If you've configured an application-based health check in your MIG, you can check the health state of each VM in the MIG. For more information, see Check whether VMs are healthy.
You can also monitor the changes in the health state of a VM. For more information, see Monitor health state changes.
Pricing
When you set up an application-based health check, by default Compute Engine writes a log entry whenever a managed instance's health state changes. Cloud Logging provides a free allotment per month after which logging is priced by data volume. To avoid costs, you can disable the health state change logs.
Behavior during a repair
The following sections explain the behavior during automatic repairs and repairs based on application health check.
Update on repair
By default, during a repair, a MIG recreates a VM using the original
instance template that was used to create the VM. For example, if a VM was
created using instance-template-a
and then you update the MIG to use
instance-template-b
in
OPPORTUNISTIC
mode, the MIG still uses instance-template-a
to recreate the
VM.
If you want your MIG to use the latest instance template and per-instance configurations during VM repair, you can configure the group to apply configuration updates during repairs.
Disk handling
During a repair, when recreating a VM based on its template, the MIG handles different types of disks differently. Some disk configurations can cause a repair to fail when attempting to recreate a VM.
Disk type | autodelete |
Behavior during a repair |
---|---|---|
New persistent disk | true |
Disk is recreated as specified in the instance template. Any data that was written to that disk is lost when the disk and its VM are recreated. |
New persistent disk | false |
Disk is preserved and reattached when MIG recreates the VM. |
Existing persistent disk | true |
Old disk is deleted. VM recreate operation fails because Compute Engine cannot reattach a deleted disk to the VM. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode. |
Existing persistent disk | false |
Old disk is reattached as specified in the instance template. The data on the disk is preserved. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode. |
New local SSD | N/A | Disk is recreated as specified in the instance template. The data on a local SSD is lost when a VM is recreated or deleted. |
The MIG does not reattach disks that are not specified in the instance template or per-instance configurations, such as disks that you attached to a VM manually after the VM was created.
To preserve important data that was written to disk, take precautions, such as the following:
- Take regular persistent disk snapshots.
- Export data to another source, such as Cloud Storage.
- Configure stateful persistent disks.
If your VMs have important settings that you want to preserve, Google also recommends that you use a custom image in your instance template. A custom image contains any custom settings that you need. When you specify a custom image in your instance template, the MIG recreates VMs using the custom image that contains the custom settings you need.
Turn off repairs
You can turn off repairs that are automatically done by a MIG. When you turn off repairs, repairing of failed VMs and repairing based on an application health check are turned off.
You might want to turn off repairs in a MIG in scenarios such as the following:
- To investigate or debug a failed VM without interruption from automatic repair.
- To repair VMs manually or implement your own repair logic.
- To prevent registering new VMs while a batch job is in progress.
- To observe application health states without repairing an unhealthy VM.
- To fine-tune health check configuration without false-triggering repairs.
When you turn off repairs, the MIG doesn't take any action if a VM in the group
fails or
becomes unhealthy. Failed and unhealthy VMs continue to be in the group and the
target number of running VMs in the MIG (targetSize
) remains the same.
If the MIG's
update type
is set to proactive
and a new instance template is available, then the MIG
tries to update the failed and unhealthy VMs.
If you've configured an application-based health check, turning off repairs doesn't affect the functioning of the health check. The health check continues to probe the application and provide the VM health states. This lets you monitor application health states while preventing the MIG from repairing unhealthy VMs.
If the MIG is part of a backend service of a load balancer and you turn off repairs in the MIG, any unrepaired failed and unhealthy VMs don't respond to the load balancer health check. If the number of these failed or unhealthy VMs in the MIG increases, the load balancer might reduce traffic to that MIG or switch to another backend, if configured. When the failed VMs become available again, the load balancer resumes the traffic to the MIG.
For more information, see Turn off repairs in a MIG.
What's next
- Set up an application-based health check and autohealing.
- Check whether repairs are turned off in a MIG.
- Apply configuration updates during repairs.