This document describes how to set up health check for an application running on each VM and enable autohealing to repair unhealthy instances. It also describes how to check the current health state of each VM.
You can configure an application-based health check to verify that your application is responding as expected. If you configure an application-based health check and the health check determines that your application isn't responding, the MIG repairs that VM. Repairing a VM based on the application health check is called autohealing.
To know about how a MIG automatically repairs VMs, see About repairing VMs in a MIG.
Pricing
When you set up an application-based health check, by default Compute Engine writes a log entry whenever a managed instance's health state changes. Cloud Logging provides a free allotment per month after which logging is priced by data volume. To avoid costs, you can disable the health state change logs.
Set up a health check and an autohealing policy
You can apply a single health check to a maximum of 50 MIGs. If you have more than 50 groups, create multiple health checks. In a MIG, you can set only one autohealing policy to configure a health check.
The following example shows how to use a health check on a MIG. In this example,
you create a health check that looks for a web server
response on port 80
. To enable the health check probes to
reach each web server, you configure a firewall rule. Finally, you apply the
health check to the MIG by setting the group's autohealing
policy.
Console
Create a health check for autohealing that is more conservative than a load balancing health check.
For example, create a health check that looks for a response on port
80
and that can tolerate some failure before it marks VMs asUNHEALTHY
and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully3
consecutive times.In the Google Cloud console, go to the Create a health check page.
Give the health check a name, such as
example-check
.For Protocol, make sure that HTTP is selected.
For Port, enter
80
.For Check interval, enter
5
.For Timeout, enter
5
.Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy VM is marked as healthy. Enter
1
for this example.Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must be returned before a healthy VM is marked as unhealthy. Enter
3
for this example.Click Create to create the health check.
Create a firewall rule to allow health check probes to connect to your app.
Health check probes come from addresses in the ranges
130.211.0.0/22
and35.191.0.0/16
, so make sure your network firewall rules allow the health check to connect. For this example, our MIG uses thedefault
network and its VMs are listening on port80
. If port80
is not already open on the default network, create a firewall rule.In the Google Cloud console, go to the Create a firewall rule page.
For Name, enter a name for the firewall rule. For example,
allow-health-check
.For Network, select the
default
network.For Source filter, select
IP ranges
.For Source IP ranges, enter
130.211.0.0/22
and35.191.0.0/16
.In Protocols and ports, select Specified protocols and ports and enter
tcp:80
.Click Create.
Apply the health check by configuring an autohealing policy for your regional or zonal MIG.
In the Google Cloud console, go to the Instance groups page.
Under the Name column of the list, click the name of the MIG where you want to apply the health check.
Click Edit to modify this MIG.
In the VM instance lifecycle section, under Autohealing, select the health check that you created previously.
Change or keep the Initial delay setting. The initial delay is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM's
currentAction
field changes toVERIFYING
. The value of initial delay must be between 0 and 3600 seconds. In the console, the default value is 300.Click Save to apply your changes.
gcloud
To use the command-line examples in this guide, install the Google Cloud CLI, or use a Cloud Shell.
Create a health check for autohealing that is more conservative than a load balancing health check.
For example, create a health check that looks for a response on port
80
and that can tolerate some failure before it marks VMs asUNHEALTHY
and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully3
consecutive times.gcloud compute health-checks create http example-check --port 80 \ --check-interval 30s \ --healthy-threshold 1 \ --timeout 10s \ --unhealthy-threshold 3
Create a firewall rule to allow health check probes to connect to your app.
Health check probes come from addresses in the ranges
130.211.0.0/22
and35.191.0.0/16
, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses thedefault
network, and its VMs listen on port80
. If port80
isn't already open on the default network, create a firewall rule.gcloud compute firewall-rules create allow-health-check \ --allow tcp:80 \ --source-ranges 130.211.0.0/22,35.191.0.0/16 \ --network default
Apply the health check by configuring an autohealing policy for your regional or zonal MIG.
Use the
update
command to apply the health check to the MIG.The
initial-delay
setting is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM'scurrentAction
field changes toVERIFYING
. The value of initial delay must be between 0 and 3600 seconds. The default value is 0.For example:
gcloud compute instance-groups managed update my-mig \ --health-check example-check \ --initial-delay 300 \ --zone us-east1-b
API
To use the API examples in this guide, set up API access.
Create a health check for autohealing that is more conservative than a load balancing health check.
For example, create a health check that looks for a response on port
80
and that can tolerate some failure before it marks VMs asUNHEALTHY
and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully3
consecutive times.POST https://compute.googleapis.com/compute/v1/projects/project-id/global/healthChecks { "name": "example-check", "type": "http", "port": 80, "checkIntervalSec": 30, "healthyThreshold": 1, "timeoutSec": 10, "unhealthyThreshold": 3 }
Create a firewall rule to allow health check probes to connect to your app.
Health check probes come from addresses in the ranges
130.211.0.0/22
and35.191.0.0/16
, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses thedefault
network and its VMs are listening on port80
. If port80
is not already open on the default network, create a firewall rule.POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/firewalls { "name": "allow-health-check", "network": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/networks/default", "sourceRanges": [ "130.211.0.0/22", "35.191.0.0/16" ], "allowed": [ { "ports": [ "80" ], "IPProtocol": "tcp" } ] }
Apply the health check by configuring an autohealing policy for your regional or zonal MIG.
An autohealing policy is part of an
instanceGroupManager
resource orregionInstanceGroupManager
resource.You can set an autohealing policy using the
insert
orpatch
methods.The following example sets an autohealing policy by using the
instanceGroupManagers.patch
method.PATCH https://compute.googleapis.com/compute/projects/[PROJECT_ID]/zones/[ZONE]/instanceGroupManagers/[INSTANCE_GROUP] { "autoHealingPolicies": [ { "healthCheck": "global/healthChecks/example-check", "initialDelaySec": 300 } ], }
The
initialDelaySec
setting is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM'scurrentAction
field changes toVERIFYING
. The value of initial delay must be between 0 and 3600 seconds. The default value is 0.To turn off application-based autohealing, set the autohealing policy to an empty value,
autoHealingPolicies[]
. WithautoHealingPolicies[]
, the MIG recreates only VMs that are not in aRUNNING
state.You can get the autohealing policy of a MIG by reading the
instanceGroupManagers.autoHealingPolicies
field. You can get a MIG resource using one of the following methods:instanceGroupManagers.get
returns the specified zonal managed instance group resource.instanceGroupManagers.list
returns all zonal MIGs in a specified project and zone.regionInstanceGroupManagers.get
returns the specified regional MIG resource.regionInstanceGroupManagers.list
returns all regional managed MIGs in a specified project and region.
After the group creation or health check configuration update completes, it can take 30 minutes before autohealing begins monitoring instances in the group. Once monitoring begins, Compute Engine begins to mark instances as healthy (or else recreates them) based on your autohealing configuration. For example, if you configure an initial delay of 5 minutes, a health check interval of 1 minute, and a healthy threshold of 1 check, the timeline looks like the following:
- 30 minute delay before autohealing begins monitoring instances in the group
- + 5 minutes for the configured initial delay
- + 1 minute for the check interval * healthy threshold (60s * 1)
- = 36 minutes before the instance is either marked as healthy or is recreated
Checking the status
You can verify that a VM is created and its application is responding by inspecting the current health state of each VM, by checking the current action on each VM, or by checking the group's status.
Checking whether VMs are healthy
If you have configured an application-based health check for your MIG, you can review the health state of each managed instance.
Inspect your managed instance health states to:
- Identify unhealthy VMs that are not being autohealed. A VM might not
be repaired immediately even if it has been diagnosed as unhealthy in the
following situations:
- The VM is still booting, and its initial delay has not passed.
- A significant share of unhealthy instances is currently being autohealed. The autohealer delays further autohealing to ensure that the group keeps running a subset of instances.
- Detect health check configuration errors. For example, you can detect
misconfigured firewall rules or an invalid application health checking
endpoint if the instance reports a health state of
TIMEOUT
. - Determine the initial delay value to configure by measuring the amount of time
between when the VM transitions to a
RUNNING
status and when the VM transitions to aHEALTHY
health state. You can measure this gap by polling thelist-instances
method or by observing the time betweeninstances.insert
operation and the first healthy signal received.
Use the
console, the
gcloud
command-line tool, or the
API
to view health states.
Console
In the Google Cloud console, go to the Instance groups page.
Under the Name column of the list, click the name of the MIG that you want to examine. A page opens with the instance group properties and a list of VMs that are included in the group.
If a VM is unhealthy, you can see its health state in the Health check status column.
gcloud
Use the list-instances
sub-command.
gcloud compute instance-groups managed list-instances instance-group
NAME ZONE STATUS HEALTH_STATE ACTION INSTANCE_TEMPLATE VERSION_NAME LAST_ERROR
igm-with-hc-fvz6 europe-west1 RUNNING HEALTHY NONE my-template
igm-with-hc-gtz3 europe-west1 RUNNING HEALTHY NONE my-template
The HEALTH_STATE
column shows each VM's health state.
API
For a regional MIG, construct a POST
request to the
listManagedInstances
method:
POST https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group/listManagedInstances
For a zonal MIG, use the zonal MIG
listManagedInstances
method:
POST https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group/listManagedInstances
The request returns a response similar to the following, which
includes an instanceHealth
field for each managed instance.
{ "managedInstances": [ { "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-5485", "instanceStatus": "RUNNING", "currentAction": "NONE", "lastAttempt": { }, "id": "6159431761228150698", "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template", "version": { "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template" }, "instanceHealth": [ { "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check", "detailedHealthState": "HEALTHY" } ] }, { "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-sfdp", "instanceStatus": "STOPPING", "currentAction": "DELETING", "lastAttempt": { }, "id": "6622324799312181783", "instanceHealth": [ { "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check", "detailedHealthState": "TIMEOUT" } ] } ] }
Health states
The following VM health states are available:
HEALTHY
: The VM is reachable, a connection to the application health checking endpoint can be established, and the response conforms to the requirements defined by the health check.DRAINING
: The VM is being drained. Existing connections to the VM have time to complete, but new connections are being refused.UNHEALTHY
: The VM is reachable, but does not conform to the requirements defined by the health check.TIMEOUT
: The VM is unreachable, a connection to the application health checking endpoint cannot be established, or the server on a VM does not respond within the specified timeout. For example, this may be caused by misconfigured firewall rules or an overloaded server application on a VM.UNKNOWN
: The health checking system is not aware of the VM or its health is not known at the moment. It can take 30 minutes for monitoring to begin on new VMs in a MIG.
New VMs return an UNHEALTHY
state until they are verified by the
health checking system.
Whether a VM is repaired depends on its health state:
- If a VM has a health state of
UNHEALTHY
orTIMEOUT
, and it has passed its initialization period, then the autohealing service immediately attempts to repair it. - If a VM has a health state of
UNKNOWN
, then it will not be repaired immediately. This is to prevent an unnecessary repair of a VM for which the health checking signal is temporarily unavailable.
Autohealing attempts can be delayed if:
- A VM remains unhealthy after multiple consecutive repairs.
- A significant overall share of unhealthy VMs exists in the group.
We want to learn about your use cases, challenges, or feedback about VM health state values. Please share your feedback with our team at mig-discuss@google.com.
Viewing current actions on VMs
When a MIG is currently in the process of creating a VM instance, the MIG sets
that instance's read-only currentAction
field to CREATING
. If an autohealing
policy is attached to the group, once the VM is created and running, the MIG
sets the instance's current action to VERIFYING
and the health checker
begins to probe the VM's application. If the application passes this initial
health check within the time that it takes for the application to start, then
the VM is verified and the MIG changes the VM's currentAction
field to NONE
.
Use the Google Cloud CLI or the Compute Engine API to see details about the instances in a managed instance group. Details include instance status and current actions that the group is performing on its instances.
gcloud
All managed instances
To check the status and current actions on all instances in the group, use
the
list-instances
command.
gcloud compute instance-groups managed list-instances INSTANCE_GROUP_NAME \ [--zone=ZONE | --region=REGION]
The command returns a list of instances in the group, including their status, current actions, and other details:
NAME ZONE STATUS HEALTH_STATE ACTION INSTANCE_TEMPLATE VERSION_NAME LAST_ERROR vm-instances-9pk4 us-central1-f CREATING my-new-template vm-instances-h2r1 us-central1-f STOPPING DELETING my-old-template vm-instances-j1h8 us-central1-f RUNNING NONE my-old-template vm-instances-ngod us-central1-f RUNNING NONE my-old-template
The HEALTH_STATE
column appears empty unless you have
set up health checking.
A specific managed instance
To check the status and current action for a specific instance in the group,
use the
describe-instance
command.
gcloud compute instance-groups managed describe-instance INSTANCE_GROUP_NAME \ --instance INSTANCE_NAME \ [--zone=ZONE | --region=REGION]
The command returns details about the instance, including instance status, current action, and, for stateful MIGs, preserved state:
currentAction: NONE id: '6789072894767812345' instance: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-a/instances/example-mig-hz41 instanceStatus: RUNNING name: example-mig-hz41 preservedStateFromConfig: metadata: example-key: example-value preservedStateFromPolicy: disks: persistent-disk-0: autoDelete: NEVER mode: READ_WRITE source: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-a/disks/example-mig-hz41 version: instanceTemplate: https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template
API
Call the listManagedInstances
method on a
regional
or zonal
MIG resource. For example, to see details about the instances in a zonal MIG
resource, you can make the following request:
GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instanceGroupManagers/INSTANCE_GROUP_NAME/listManagedInstances
The call returns a list of instances for the MIG including each instance's
instanceStatus
and currentAction
.
{ "managedInstances": [ { "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-prvp", "id": "5317605642920955957", "instanceStatus": "RUNNING", "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template", "currentAction": "REFRESHING" }, { "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-pz5j", "currentAction": "DELETING" }, { "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-w2t5", "id": "2800161036826218547", "instanceStatus": "RUNNING", "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template", "currentAction": "REFRESHING" } ] }
To see a list of valid instanceStatus
field values, see
VM instance lifecycle.
If an instance is undergoing some type of change, the managed instance group
sets the instance's currentAction
field to one of the following actions to
help you track the progress of the change. Otherwise, the currentAction
field
is set to NONE
.
Possible currentAction
values are:
ABANDONING
. The instance is being removed from the MIG.CREATING
. The instance is in the process of being created.CREATING_WITHOUT_RETRIES
. The instance is being created without retries; if the instance isn't created on the first try, the MIG doesn't try to replace the instance again.DELETING
. The instance is in the process of being deleted.RECREATING
. The instance is being replaced.REFRESHING
. The instance is being removed from its current target pools and being readded to the list of current target pools (this list might be the same or different from existing target pools).RESTARTING
. The instance is in the process of being restarted using thestop
andstart
methods.VERIFYING
. The instance has been created and is in the process of being verified.NONE
. No actions are being performed on the instance.
Checking whether the MIG is stable
At the group level, Compute Engine populates a read-only field called
status
that contains an isStable
flag.
If all VMs in the group are running and healthy (that is, the
currentAction
field for each managed instance is set to NONE
), then the MIG sets the
status.isStable
field to true
. Remember that the stability of a MIG depends
on group configurations beyond the autohealing policy; for example, if your
group is autoscaled, and if it is currently scaling in or out, then the MIG sets
the status.isStable
field to false
due to the autoscaler operation.
Verify that all instances in a managed instance group are running and healthy by
checking the value of the group's status.isStable
field.
gcloud
Use the
describe
command:
gcloud compute instance-groups managed describe instance-group-name \ [--zone zone | --region region]
The gcloud CLI returns detailed information about the MIG
including its status.isStable
field.
To pause a script until the MIG is stable, use the
wait-until
command with the --stable
flag. For example:
gcloud compute instance-groups managed wait-until instance-group-name \
--stable \
[--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable
The command returns after status.isStable
is set to true
for the MIG.
API
For a zonal MIG, make a GET
request to the
instanceGroupManagers.get
method:
GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group-name/get
For a regional managed instance group, replace
zones/zone
with regions/region
:
GET https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group-name/get
The Compute Engine API returns detailed information about the MIG including its
status.isStable
field.
status.isStable
set to false
indicates that changes are active, pending, or
that the MIG itself is being modified.
status.isStable
set to true
indicates the following:
- None of the instances in the MIG are undergoing any type of change and the
currentAction
for all instances isNONE
. - No changes are pending for instances in the MIG.
- The MIG itself is not being modified.
Remember that the stability of a MIG depends on numerous factors because a MIG can be modified in numerous ways. For example:
- You make a request to roll out a new instance template.
- You make a request to create, delete, resize or update instances in the MIG.
- An autoscaler requests to resize the MIG.
- An autohealer resource is replacing one or more unhealthy instances in the MIG.
- In a regional MIG, some of the instances are being redistributed.
As soon as all actions are finished, status.isStable
is set to true
again
for that MIG.
Viewing historical autohealing operations
You can use the gcloud CLI or the API to view past autohealing events.
gcloud
Use the gcloud compute operations list
command with a
filter
to see only the autohealing repair events in your project.
gcloud compute operations list --filter='operationType~compute.instances.repair.*'
For more information about a specific repair operation, use the
describe
command. For example:
gcloud compute operations describe repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5 --zone us-east1-b
API
For regional MIGs, submit a GET
request to the
regionOperations
resource and include a filter to scope the output list to
compute.instances.repair.*
events.
GET https://compute.googleapis.com/compute/v1/projects/project-id/region/region/operations?filter=operationType+%3D+%22compute.instances.repair.*%22
For zonal MIGs, use the
zoneOperations
resource.
GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations?filter=operationType+%3D+%22compute.instances.repair.*%22
For more information about a specific repair operation, submit a GET
request for that specific operation. For example:
GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations/repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5
What makes a good autohealing health check
Health checks used for autohealing should be conservative so they don't preemptively delete and recreate your instances. When an autohealer health check is too aggressive, the autohealer might mistake busy instances for failed instances and unnecessarily restart them, reducing availability.
unhealthy-threshold
. Should be more than1
. Ideally, set this value to3
or more. This protects against rare failures like a network packet loss.healthy-threshold
. A value of2
is sufficient for most apps.timeout
. Set this time value to a generous amount (five times or more than the expected response time). This protects against unexpected delays like busy instances or a slow network connection.check-interval
. This value should be between 1 second and two times the timeout (not too long nor too short). When a value is too long, a failed instance is not caught soon enough. When a value is too short, the instances and the network can become measurably busy, given the high number of health check probes being sent every second.
What's next
- Try the tutorial, Using autohealing for highly available apps.
- Monitor VM health state changes
- Apply configuration updates during repairs