This document describes how a managed instance group (MIG) provides high availability of your application by repairing failed and unhealthy VMs in the group.
A MIG keeps your application up and available by proactively maintaining the group's target size. If a VM in the group goes down, the MIG repairs the VM by recreating it in the following ways to bring the VM back to service:
- Automatically repair a failed VM: If a VM fails or is deleted by an action not initiated by the MIG, then the MIG automatically repairs the failed VM.
- Autoheal a VM based on an application health check: Autohealing is an optional way to further improve high availability by repairing unhealthy VMs. If you configure an application-based health check and your application fails the health check, then the MIG repairs that VM.
Automatically repair a failed VM
If a VM in a MIG fails, the MIG automatically repairs the failed VM by recreating it. A VM can fail due to the following reasons:
- Unexpected reasons like a hardware failure.
- Actions not initiated by the MIG, such as the following:
If the MIG intentionally stops a VM—for example, when an autoscaler deletes a VM—then the MIG doesn't repair that VM.
To ensure that the MIG doesn't
revert your configuration changes, use the MIG's methods:
the instance groups console page,
instance-groups managed gcloud CLI commands,
and the zonal or
instance group manager API resources.
Autoheal a VM based on an application health check
In addition to the automatic repair of failed VMs, you might want to repair a VM if your application running on the VM freezes, crashes, or runs out of memory. To ensure that the application is responding as expected, you can configure an application-based health check.
An application-based health check periodically verifies that your application on each VM in a MIG is responding as expected. If the application on a VM doesn't respond, then the MIG marks that VM as unhealthy. The MIG then autoheals the unhealthy VM.
Each MIG has an autohealing policy in which you can configure a health check and
also set an initial delay. The initial delay is the time that a new VM takes to
initialize and run its startup script. The initial delay timer starts when the
MIG changes the VM's
VERIFYING. During a VM's initial delay period, the
MIG ignores unsuccessful health checks because the VM might be in the startup
process. This prevents the MIG from prematurely recreating a VM. If the health
check receives a healthy response during the initial delay, it indicates that
the startup process is complete and the VM is ready.
To ensure that the MIG keeps running a subset of its VMs, the group never concurrently autoheals all of its instances. This is useful if, for example, the autohealing policy does not fit the workload, firewall rules are misconfigured, or there are network connectivity or infrastructure issues that misidentify a healthy VM as unhealthy. However, if a zonal MIG has only one VM, or a regional MIG has only one VM per zone, a MIG recreates these VMs when they become unhealthy.
For more information to configure an autohealing policy, see Set up an application health check and autohealing.
Monitor health state changes
If you've configured an application-based health check for your MIG, you can check the health state of each VM in the MIG. For more information, see Checking the status.
When you set up an application-based health check, by default Compute Engine writes a log entry whenever a managed instance's health state changes. Cloud Logging provides a free allotment per month after which logging is priced by data volume. To avoid costs, you can disable the health state change logs.
Behavior during a repair
The following sections explain the general behavior during a repair that applies to both automatic repair and autohealing.
Update on repair
By default, during a repair, a MIG recreates a VM using the original
instance template that was used to create the VM. For example, if a VM was
instance-template-a and then you update the MIG to use
mode, the MIG still uses
instance-template-a to recreate the
When recreating a VM based on its template, the MIG handles different types of disks differently. Some disk configurations can cause repair to fail when attempting to recreate a VM.
||Behavior during a repair|
|New persistent disk||
||Disk is recreated as specified in the instance template. Any data that was written to that disk is lost when the disk and its VM are recreated.|
|New persistent disk||
||Disk is preserved and reattached when MIG recreates the VM.|
|Existing persistent disk||
||Old disk is deleted. VM recreate operation fails because Compute Engine cannot reattach a deleted disk to the VM. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.|
|Existing persistent disk||
||Old disk is reattached as specified in the instance template. The data on the disk is preserved. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.|
|New local SSD||N/A||Disk is recreated as specified in the instance template. The data on a local SSD is lost when a VM is recreated or deleted.|
The MIG does not reattach disks that are not specified in the instance template or per-instance configurations, such as disks that you attached to a VM manually after the VM was created.
To preserve important data that was written to disk, take precautions, such as the following:
- Take regular persistent disk snapshots.
- Export data to another source, such as Cloud Storage.
- Configure stateful persistent disks.
If your VMs have important settings that you want to preserve, Google also recommends that you use a custom image in your instance template. A custom image contains any custom settings that you need. When you specify a custom image in your instance template, the MIG recreates VMs using the custom image that contains the custom settings you need.