Setting up health checking and autohealing

Managed instance groups (MIGs) maintain high availability of your applications by proactively keeping your virtual machine (VM) instances available, which means in RUNNING state. If a managed instance stops running, but the change of state was not initiated by the MIG, then the MIG automatically recreates that instance. On the other hand, if the MIG intentionally stops an instance from RUNNING—for example, when an autoscaler deletes an instance—then the MIG doesn't recreate that instance.

Changes of instance state that are not initiated by the MIG include:

However, relying on an instance's state to determine application health might not be sufficient. For example, a check for whether an instance is RUNNING does not detect application failures, such as freezing, overloading, or crashing.

To further improve the availability of your application and to verify that your application is responding, you can configure an autohealing policy for your MIG.

An autohealing policy relies on an application-based health check to verify that an application is responding as expected. Checking that an application responds is more precise than simply verifying that a VM is in a RUNNING state.

If the autohealer determines that an application isn't responding, the MIG automatically recreates that VM. For a preemptible VM, the group recreates the VM when the necessary resources become available again.

Autohealing behavior

Autohealing recreates unhealthy VMs using the original instance template that was used to create the VM (not necessarily the current instance template in the MIG). For example, if a VM was created using instance-template-a and then you update the MIG to use instance-template-b in OPPORTUNISTIC mode, autohealing still uses instance-template-a to recreate the VM. This is because autohealing recreations are not user-initiated so Compute Engine doesn't assume that the VM should use the new template. If you want to apply a new template, see Changing the instance template for a MIG.

At any given time, the number of concurrently autohealed VMs is smaller than the MIG's size. This ensures that the group keeps running a subset of VMs even if, for example, the autohealing policy does not fit the workload, firewall rules are misconfigured, or there are network connectivity or infrastructure issues that misidentify a healthy VM as unhealthy. However, if a zonal MIG has only one VM, or a regional MIG has only one VM per zone, autohealing recreates these VMs when they become unhealthy.

Autohealing doesn't recreate a VM during that VM's initialization period. For more information, see the autoHealingPolicies[].initialDelaySec property. This setting delays autohealing from checking on and potentially prematurely recreating the VM if the VM is in the process of starting up. The initial delay timer starts when the VM has a currentAction of VERIFYING.

When you first attach a health check to a managed instance group, it can take 30 minutes before monitoring begins.

Autohealing and disks

When recreating a VM based on its template, the autohealer handles different types of disks differently. Some disk configurations can cause autohealer to fail when attempting to recreate a VM.

Disk type autodelete Behaviour during an autohealing operation
New persistent disk true Disk is recreated as specified in the instance's template. Any data that was written to that disk is lost when the disk and its VM are recreated.
New persistent disk false Disk is preserved and reattached when autohealer recreates the VM.
Existing persistent disk true Old disk is deleted. VM recreation fails because Compute Engine cannot reattach a deleted disk to the VM.
Existing persistent disk false Old disk is reattached as specified in the instance's template. The data on the disk is preserved. However, for existing read/write disks, a MIG can have only up to one VM because a single persistent disk cannot be attached to multiple VMs in read/write mode.
New local SSD N/A Disk is recreated as specified in the instance's template. The data on a local SSD is lost when a VM is recreated or deleted.

The autohealer does not reattach disks that are not specified in the instance's template, such as disks that you attached to a VM manually after the VM was created.

To preserve important data that was written to disk, take precautions, such as:

If your VMs have important settings that you want to preserve, Google also recommends that you use a custom image in your instance template. A custom image contains any custom settings you need. When you specify a custom image in your instance template, the MIG recreates VMs using the custom image that contains the custom settings you need.

Setting up a health check and an autohealing policy

You can set a maximum of one autohealing policy per MIG.

You can apply a single health check to a maximum of 50 MIGs. If you have more than 50 groups, create multiple health checks.

Example health check set up

The following example shows how to use a health check on a MIG. In this example, you create a health check that looks for a web server response on port 80. To enable the health check probes to reach each web server, you configure a firewall rule. Finally, you apply the health check to the MIG by setting the group's autohealing policy.

Console

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    1. In the Google Cloud Console, go to the Create a health check page.

      Go to the Create a health checks page

    2. Give the health check a name, such as example-check.
    3. For Protocol, make sure that HTTP is selected.
    4. For Port, enter 80.
    5. For Check interval, enter 5.
    6. For Timeout, enter 5.
    7. Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy VM is marked as healthy. Enter 1 for this example.
    8. Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must returned before a healthy VM is marked as unhealthy. Enter 3 for this example.
    9. Click Create to create the health check.
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your network firewall rules allow the health check to connect. For this example, our MIG uses the default network and its VMs are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    1. In the Google Cloud Console, go to the Create a firewall rule page.

      Go to the Create a firewall rules page

    2. For Name, enter a name for the firewall rule. For example, allow-health-check.
    3. For Network, select the default network.
    4. For Source filter, select IP ranges.
    5. For Source IP ranges, enter 130.211.0.0/22 and 35.191.0.0/16.
    6. In Protocols and ports, select Specified protocols and ports and enter tcp:80.
    7. Click Create.
  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    1. In the Google Cloud Console, go to the Instance groups page.

      Go to the Instance groups page

    2. Under the Name column of the list, click the name of the MIG where you want to apply the health check.
    3. Click Edit group to modify this MIG.
    4. Under Autohealing, select the health check that you created previously.
    5. Change or keep the Initial delay setting. This setting delays autohealing from potentially prematurely recreating the VM if the VM is in the process of starting up. The initial delay timer starts when the currentAction of the VM is VERIFYING.
    6. Click Save to apply your changes.

gcloud

To use the command-line examples in this guide, install the gcloud command-line tool, or use a Cloud Shell.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    gcloud compute health-checks create http example-check --port 80 \
           --check-interval 30s \
           --healthy-threshold 1 \
           --timeout 10s \
           --unhealthy-threshold 3
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses the default network, and its VMs listen on port 80. If port 80 isn't already open on the default network, create a firewall rule.

    gcloud compute firewall-rules create allow-health-check \
            --allow tcp:80 \
            --source-ranges 130.211.0.0/22,35.191.0.0/16 \
            --network default
  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    Use the update command to apply the health check to the MIG.

    The initial-delay setting delays autohealing from potentially prematurely recreating the VM if the VM is in the process of starting up. The initial delay timer starts when the currentAction of the VM is VERIFYING.

    For example:

    gcloud compute instance-groups managed update my-mig \
            --health-check example-check \
            --initial-delay 300 \
            --zone us-east1-b

API

To use the API examples in this guide, set up API access.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    POST https://compute.googleapis.com/compute/v1/projects/project-id/global/healthChecks
    
    {
     "name": "example-check",
     "type": "http",
     "port": 80,
     "checkIntervalSec": 30,
     "healthyThreshold": 1,
     "timeoutSec": 10,
     "unhealthyThreshold": 3
    }
    
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses the default network and its VMs are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/firewalls
    
    {
     "name": "allow-health-check",
     "network": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/networks/default",
     "sourceRanges": [
      "130.211.0.0/22",
      "35.191.0.0/16"
     ],
     "allowed": [
      {
       "ports": [
        "80"
       ],
       "IPProtocol": "tcp"
      }
     ]
    }
    
  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    An autohealing policy is part of an instanceGroupManager resource or regionInstanceGroupManager resource.

    You can set an autohealing policy using the insert or patch methods.

    The following example sets an autohealing policy by using the instanceGroupManagers.patch method.

    PATCH https://compute.googleapis.com/compute/projects/[PROJECT_ID]/zones/[ZONE]/instanceGroupManagers/[INSTANCE_GROUP]
    {
      "autoHealingPolicies": [
        {
          "healthCheck": "global/healthChecks/example-check",
          "initialDelaySec": 300
        }
      ],
    }
    

    The initialDelaySec setting delays autohealing from potentially prematurely recreating the VM if the VM is in the process of starting up. The initial delay timer starts when the currentAction of the VM is VERIFYING.

    To turn off application-based autohealing, set the autohealing policy to an empty value, autoHealingPolicies[]. With autoHealingPolicies[], the MIG recreates only VMs that are not in a RUNNING state.

    You can get the autohealing policy of a MIG by reading the instanceGroupManagers.autoHealingPolicies field. You can get a MIG resource using one of the following methods:

After the group creation or health check configuration update completes, it can take 30 minutes before autohealing begins monitoring instances in the group. Once monitoring begins, Compute Engine begins to mark instances as healthy (or else recreates them) based on your autohealing configuration. For example, if you configure an initial delay of 5 minutes, a health check interval of 1 minute, and a healthy threshold of 1 check, the timeline looks like the following:

  • 30 minute delay before autohealing begins monitoring instances in the group
  • + 5 minutes for the configured initial delay
  • + 1 minute for the check interval * healthy threshold (60s * 1)
  • = 36 minutes before the instance is either marked as healthy or is recreated

Checking the status

You can verify that a VM is created and its application is responding by inspecting the current health state of each VM, by checking the current action on each VM, or by checking the group's status.

Checking whether VMs are healthy

If you have configured autohealing for your MIG, you can review the health state of each managed instance.

Inspect your managed instance health states to:

  • Identify unhealthy VMs that are not being autohealed. A VM might not be repaired immediately even if it has been diagnosed as unhealthy in the following situations:
    • The VM is still booting, and its initial delay has not passed.
    • A significant share of unhealthy instances is currently being autohealed. The autohealer delays further autohealing to ensure that the group keeps running a subset of instances.
  • Detect health check configuration errors. For example, you can detect misconfigured firewall rules or an invalid application health checking endpoint if the instance reports a health state of TIMEOUT.
  • Determine the initial delay value to configure by measuring the amount of time between when the VM transitions to a RUNNING status and when the VM transitions to a HEALTHY health state. You can measure this gap by polling the list-instances method.

Use the console, the gcloud command-line tool, or the API to view health states.

Console

  1. In the Google Cloud Console, go to the Instance groups page.

    Go to the Instance groups page.

  2. Under the Name column of the list, click the name of the MIG that you want to examine. A page opens with the instance group properties and a list of VMs that are included in the group.

  3. If a VM is unhealthy, you can see its health state in the Health check status column.

gcloud

Use the list-instances sub-command.

gcloud compute instance-groups managed list-instances instance-group
NAME              ZONE                  STATUS   HEALTH_STATE  ACTION  INSTANCE_TEMPLATE                            VERSION_NAME  LAST_ERROR
igm-with-hc-fvz6  europe-west1          RUNNING  HEALTHY       NONE    my-template
igm-with-hc-gtz3  europe-west1          RUNNING  HEALTHY       NONE    my-template

The HEALTH_STATE column shows each VM's health state.

API

For a regional MIG, construct a POST request to the listManagedInstances method:

POST https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group/listManagedInstances

For a zonal MIG, use the zonal MIG listManagedInstances method:

POST https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group/listManagedInstances

The request returns a response similar to the following, which includes an instanceHealth field for each managed instance.

{
 "managedInstances": [
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-5485",
   "instanceStatus": "RUNNING",
   "currentAction": "NONE",
   "lastAttempt": {
   },
   "id": "6159431761228150698",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template",
   "version": {
    "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template"
   },
   "instanceHealth": [
    {
     "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "HEALTHY"
    }
   ]
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-sfdp",
   "instanceStatus": "STOPPING",
   "currentAction": "DELETING",
   "lastAttempt": {
   },
   "id": "6622324799312181783",
   "instanceHealth": [
    {
     "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "TIMEOUT"
    }
   ]
  }
 ]
}

Health states

The following VM health states are available:

  • HEALTHY: The VM is reachable, a connection to the application health checking endpoint can be established, and the response conforms to the requirements defined by the health check..
  • DRAINING: The VM is being drained. Existing connections to the VM have time to complete, but new connections are being refused.
  • UNHEALTHY: The VM is reachable, but does not conform to the requirements defined by the health check.
  • TIMEOUT: The VM is unreachable: a connection to the application health checking endpoint cannot be established, or the server on a VM does not respond within the specified timeout. For example, this may be caused by misconfigured firewall rules or an overloaded server application on a VM.
  • UNKNOWN: The health checking system is not aware of the VM or its health is not known at the moment. It can take 30 minutes for monitoring to begin on new VMs in a MIG.

New VMs return an UNHEALTHY state until they are verified by the health checking system.

Whether a VM is repaired depends on its health state:

  • If a VM has a health state of UNHEALTHY or TIMEOUT, and it has passed its initialization period, then the autohealing service immediately attempts to repair it.
  • If a VM has a health state of UNKNOWN, then it will not be repaired immediately. This is to prevent an unnecessary repair of a VM for which the health-checking signal is temporarily unavailable.

Autohealing attempts can be delayed if:

  • A VM remains unhealthy after multiple consecutive repairs.
  • A significant overall share of unhealthy VMs exists in the group.

We want to learn about your use cases, challenges, or feedback about VM health state values. Please share your feedback with our team at mig-discuss@google.com.

Viewing current actions on VMs

When a VM is in the process of being created, that instance's currentAction is CREATING. If an autohealing policy is attached to the group, once the VM is created and running, the instance proceeds to a currentAction of VERIFYING and the health checker begins to probe the VM's application. If the application passes this initial health check within the time that it takes for the application to start, then the VM is verified and its currentAction flips to NONE.

You can see the currentAction being performed and the status of each instance in a managed instance group with the gcloud command-line tool or the Compute Engine API.

gcloud

gcloud compute instance-groups managed list-instances INSTANCE_GROUP_NAME \
    [--zone=ZONE | --region=REGION]

The command returns a list of instances in the MIG and their statuses and current actions. For example:

NAME               ZONE           STATUS   HEALTH_STATE  ACTION  INSTANCE_TEMPLATE  VERSION_NAME  LAST_ERROR
vm-instances-9pk4  us-central1-f                          CREATING  my-new-template
vm-instances-h2r1  us-central1-f  STOPPING                DELETING  my-old-template
vm-instances-j1h8  us-central1-f  RUNNING                 NONE      my-old-template
vm-instances-ngod  us-central1-f  RUNNING                 NONE      my-old-template

The HEALTH_STATE column appears empty unless you have set up health checking.

API

Call the listManagedInstances method on a regional or zonal MIG resource. For example:

GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instanceGroupManagers/INSTANCE_GROUP_NAME/listManagedInstances

The call returns a list of instances for the MIG including each instance's instanceStatus and currentAction.

{
 "managedInstances": [
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-prvp",
   "id": "5317605642920955957",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template",
   "currentAction": "REFRESHING"
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-pz5j",
   "currentAction": "DELETING"
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-w2t5",
   "id": "2800161036826218547",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template",
   "currentAction": "REFRESHING"
  }
 ]
}

To see a list of valid instanceStatus field values, see Checking an instance's status.

If an instance is undergoing some type of change, its currentAction field is populated with one of the following actions to help you track the progress of the change. Otherwise, the currentAction field is NONE.

Possible currentAction values are:

  • ABANDONING. The instance is being removed from the MIG.
  • CREATING. The instance is in the process of being created.
  • CREATING_WITHOUT_RETRIES. The instance is being created without retries; if the instance isn't created on the first try, the MIG doesn't try to replace the instance again.
  • DELETING. The instance is in the process of being deleted.
  • RECREATING. The instance is being replaced.
  • REFRESHING. The instance is being removed from its current target pools and being readded to the list of current target pools (this list might be the same or different from existing target pools).
  • RESTARTING. The instance is in the process of being restarted using the stop and start methods.
  • VERIFYING. The instance has been created and is in the process of being verified.
  • NONE. No actions are being performed on the instance.

Checking whether the MIG is stable

At the group level, Compute Engine populates a read-only field called status that contains an isStable flag.

If all VMs in the group are running and healthy (that is, the currentAction of each managed instance is NONE), thenstatus.isStable==true. Remember that the stability of a MIG depends on group configurations beyond the autohealing policy; for example, if your group is autoscaled, and if it is currently scaling in or out, then isStable==false due to the autoscaler operation.

Verify that all instances in a managed instance group are running and healthy by checking the value of the group's status.isStable field.

gcloud

Use the describe command:

gcloud compute instance-groups managed describe instance-group-name \
    [--zone zone | --region region]

gcloud tool returns detailed information about the MIG including its status.isStable field.

To pause a script until the MIG is stable, use the wait-until command with the --stable flag. For example:

gcloud compute instance-groups managed wait-until instance-group-name \
    --stable \
    [--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable

The command returns after status.isStable is set to true for the MIG.

API

For a zonal MIG, make a GET request to the instanceGroupManagers.get method:

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group-name/get

For a regional managed instance group, replace zones/zone with regions/region:

GET https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group-name/get

The Compute Engine API returns detailed information about the MIG including its status.isStable field.

status.isStable set to false indicates that changes are active, pending, or that the MIG itself is being modified.

status.isStable set to true indicates the following:

  • None of the instances in the MIG are undergoing any type of change and the currentAction for all instances is NONE.
  • No changes are pending for instances in the MIG.
  • The MIG itself is not being modified.

Remember that the stability of a MIG depends on numerous factors because a MIG can be modified in numerous ways. For example:

  • You make a request to roll out a new instance template.
  • You make a request to create, delete, resize or update instances in the MIG.
  • An autoscaler requests to resize the MIG.
  • An autohealer resource is replacing one or more unhealthy instances in the MIG.
  • In a regional MIG, some of the instances are being redistributed.

As soon as all actions are finished, status.isStable is set to true again for that MIG.

Viewing historical autohealing operations

You can use the gcloud tool or the API to view past autohealing events.

gcloud

Use the gcloud compute operations list command with a filter to see only the autohealing repair events in your project.

gcloud compute operations list --filter='operationType~compute.instances.repair.*'

For more information about a specific repair operation, use the describe command. For example:

gcloud compute operations describe repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5 --zone us-east1-b

API

For regional MIGs, submit a GET request to the regionOperations resource and include a filter to scope the output list to compute.instances.repair.* events.

GET https://compute.googleapis.com/compute/v1/projects/project-id/region/region/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For zonal MIGs, use the zoneOoperations resource.

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For more information about a specific repair operation, submit a GET request for that specific operation. For example:

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations/repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5

What makes a good autohealing health check

Health checks used for autohealing should be conservative so they don't preemptively delete and recreate your instances. When an autohealer health check is too aggressive, the autohealer might mistake busy instances for failed instances and unnecessarily restart them, reducing availability.

  • unhealthy-threshold. Should be more than 1. Ideally, set this value to 3 or more. This protects against rare failures like a network packet loss.
  • healthy-threshold. A value of 2 is sufficient for most apps.
  • timeout. Set this time value to a generous amount (five times or more than the expected response time). This protects against unexpected delays like busy instances or a slow network connection.
  • check-interval. This value should be between 1 second and two times the timeout (not too long nor too short). When a value is too long, a failed instance is not caught soon enough. When a value is too short, the instances and the network can become measurably busy, given the high number of health check probes being sent every second.

What's next