Setting up health checking and autohealing

Managed instance groups (MIGs) maintain high availability of your applications by proactively keeping your virtual machine (VM) instances available, which means in RUNNING state. If an instance stops RUNNING and the change of state was not initiated by the MIG (for example, a hardware failure as opposed to an autoscaler decision), then the MIG automatically recreates that instance. However, relying on an instance's state to determine application health might not be sufficient. For example, a check whether an instance is RUNNING does not detect application failures, such as freezing, overloading, or crashing.

To improve the availability of your application and to verify that your application is responding, you can configure an autohealing policy for your managed instance group (MIG).

An autohealing policy relies on an application-based health check to verify that an application is responding as expected. Checking that an application responds is more precise than simply verifying that an instance is in a RUNNING state.

If the autohealer determines that an application isn't responding, the managed instance group automatically recreates that instance. In the case of a preemptible instance, the group recreates the instance when the necessary resources become available again.

Autohealing behavior

Autohealing recreates unhealthy instances using the original instance template that was used to create the virtual machine (VM) instance (not necessarily the current instance template in the managed instance group). For example, if a VM instance was created using instance-template-a and then you update the managed instance group to use instance-template-b in OPPORTUNISTIC mode, autohealing still uses instance-template-a to recreate the instance. This is because autohealing recreations are not user-initiated so Compute Engine doesn't assume that the VM instance should use the new template. If you want to apply a new template, see Changing the instance template for a managed instance group.

At any given time, the number of concurrently autohealed instances is smaller than the managed instance group size. This ensures that the group keeps running a subset of instances even if, for example, the autohealing policy does not fit the workload, firewall rules are misconfigured, or there are network connectivity or infrastructure issues that misidentify a healthy instance as unhealthy. However, if a zonal managed instance group has only one instance, or a regional managed instance group has only one instance per zone, autohealing recreates these instances when they become unhealthy.

Autohealing doesn't recreate an instance during that instance's initialization period. For more information, see the autoHealingPolicies[].initialDelaySec property. This setting delays autohealing from checking on and potentially prematurely recreating the instance if the instance is in the process of starting up. The initial delay timer starts when the instance has a currentAction of VERIFYING.

Autohealing and disks

When recreating an instance based on its template, the autohealer handles different types of disks differently. Some disk configurations can cause autohealer to fail when attempting to recreate a managed instance.

Disk type autodelete Behaviour during an autohealing operation
New persistent disk true Disk is recreated as specified in the instance's template. Any data that was written to that disk is lost when the disk and its instance are recreated.
New persistent disk false Disk is preserved and reattached when autohealer recreates the instance.
Existing persistent disk true Old disk is deleted. VM instance recreation fails because Compute Engine cannot reattach a deleted disk to the instance.
Existing persistent disk false Old disk is reattached as specified in the instance's template. The data on the disk is preserved. However, for existing read/write disks, a managed instance group can have only up to one VM because a single persistent disk cannot be attached to multiple instances in read/write mode.
New local SSD N/A Disk is recreated as specified in the instance's template. The data on a local SSD is lost when an instance is recreated or deleted.

The autohealer does not reattach disks that are not specified in the instance's template, such as disks that you attached to a VM manually after the VM was created.

To preserve important data that was written to disk, take precautions, such as:

  • Take regular persistent disk snapshots .

  • Export data to another source, such as Cloud Storage.

If your instances have important settings that you want to preserve, Google also recommends that you use a custom image in your instance template. A custom image contains any custom settings you need. When you specify a custom image in your instance template, the managed instance group (MIG) recreates instances using the custom image that contains the custom settings you need.

Setting up a health check and an autohealing policy

You can set a maximum of one autohealing policy per managed instance group.

You can apply a single health check to a maximum of 50 managed instance groups. If you have more than 50 groups, create multiple health checks.

Example health check set up

The following example shows how to use a health check on a managed instance group. In this example, you create a health check that looks for a web server response on port 80. To enable the health check probes to reach each web server, you configure a firewall rule. Finally, you apply the health check to the managed instance group by setting the group's autohealing policy.

Console

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks instances as UNHEALTHY and causes them to be recreated. In this example, an instance is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    1. In the Google Cloud Console, go to the Create a health check page.

      Go to the Create a health checks page

    2. Give the health check a name, such as example-check.
    3. For Protocol, make sure that HTTP is selected.
    4. For Port, enter 80.
    5. For Check interval, enter 5.
    6. For Timeout, enter 5.
    7. Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy instance is marked as healthy. Enter 1 for this example.
    8. Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must returned before a health instance is marked as unhealthy. Enter 3 for this example.
    9. Click Create to create the health check.
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your network firewall rules allow the health check to connect. For this example, our managed instance group uses the default network and its instances are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    1. In the Google Cloud Console, go to the Create a firewall rule page.

      Go to the Create a firewall rules page

    2. For Name, enter a name for the firewall rule. For example, allow-health-check.
    3. For Network, select the default network.
    4. For Source filter, select IP ranges.
    5. For Source IP ranges, enter 130.211.0.0/22 and 35.191.0.0/16.
    6. In Protocols and ports, select Specified protocols and ports and enter tcp:80.
    7. Click Create.
  3. Apply the health check by configuring an autohealing policy for your regional or zonal managed instance group.

    1. In the Google Cloud Console, go to the Instance groups page.

      Go to the Instance groups page

    2. Under the Name column of the list, click the name of the instance group where you want to apply the health check.
    3. Click Edit group to modify this managed instance group.
    4. Under Autohealing, select the health check that you created previously.
    5. Change or keep the Initial delay setting. This setting delays autohealing from potentially prematurely recreating the instance if the instance is in the process of starting up. The initial delay timer starts when the currentAction of the instance is VERIFYING.
    6. Click Save to apply your changes.

    It can take 15 minutes before autohealing begins monitoring instances in the group.

gcloud

To use the command-line examples in this guide, install the gcloud command-line tool, or use a Cloud Shell.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks instances as UNHEALTHY and causes them to be recreated. In this example, an instance is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    gcloud compute health-checks create http example-check --port 80 \
           --check-interval 30s \
           --healthy-threshold 1 \
           --timeout 10s \
           --unhealthy-threshold 3
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our managed instance group uses the default network, and its instances listen on port 80. If port 80 isn't already open on the default network, create a firewall rule.

    gcloud compute firewall-rules create allow-health-check \
            --allow tcp:80 \
            --source-ranges 130.211.0.0/22,35.191.0.0/16 \
            --network default
  3. Apply the health check by configuring an autohealing policy for your regional or zonal managed instance group.

    Use the update command to apply the health check to the managed instance group.

    The initial-delay setting delays autohealing from potentially prematurely recreating the instance if the instance is in the process of starting up. The initial delay timer starts when the currentAction of the instance is VERIFYING.

    For example:

    gcloud compute instance-groups managed update my-mig \
            --health-check example-check \
            --initial-delay 300 \
            --zone us-east1-b

    It can take 15 minutes before autohealing begins monitoring instances in the group.

API

To use the API examples in this guide, set up API access.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks instances as UNHEALTHY and causes them to be recreated. In this example, an instance is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    POST https://compute.googleapis.com/compute/v1/projects/project-id/global/healthChecks
    
    {
     "name": "example-check",
     "type": "http",
     "port": 80,
     "checkIntervalSec": 30,
     "healthyThreshold": 1,
     "timeoutSec": 10,
     "unhealthyThreshold": 3
    }
    
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our managed instance group uses the default network and its instances are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/firewalls
    
    {
     "name": "allow-health-check",
     "network": "https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/networks/default",
     "sourceRanges": [
      "130.211.0.0/22",
      "35.191.0.0/16"
     ],
     "allowed": [
      {
       "ports": [
        "80"
       ],
       "IPProtocol": "tcp"
      }
     ]
    }
    
  3. Apply the health check by configuring an autohealing policy for your regional or zonal managed instance group.

    An autohealing policy is part of an instanceGroupManager resource or regionInstanceGroupManager resource.

    You can set an autohealing policy using the insert or patch methods.

    The following example sets an autohealing policy by using the instanceGroupManagers.patch method.

    PATCH https://compute.googleapis.com/compute/projects/[PROJECT_ID]/zones/[ZONE]/instanceGroupManagers/[INSTANCE_GROUP]
    {
      "autoHealingPolicies": [
        {
          "healthCheck": "global/healthChecks/example-check",
          "initialDelaySec": 300
        }
      ],
    }
    

    The initialDelaySec setting delays autohealing from potentially prematurely recreating the instance if the instance is in the process of starting up. The initial delay timer starts when the currentAction of the instance is VERIFYING.

    It can take 15 minutes before autohealing begins monitoring instances in the group.

    To turn off application-based autohealing, set the autohealing policy to an empty value, autoHealingPolicies[]. With autoHealingPolicies[], the managed instance group recreates only instances that are not in a RUNNING state.

    You can get the autohealing policy of a managed instance group by reading the instanceGroupManagers.autoHealingPolicies field. You can get a managed instance group resource using one of the following methods:

Checking the status

You can verify that an instance is created and its application is responding by inspecting the current health state of each instance, by checking the current action on each instance, or by checking the group's status.

When you first attach a health check to a managed instance group, it can take 15 minutes before monitoring begins.

Checking whether instances are healthy

If you have configured autohealing for your managed instance group, you can review the health state of each instance.

Inspect your managed instances' health states to:

  • Identify unhealthy VMs that are not being autohealed. A VM instance might not be repaired immediately even if it has been diagnosed as unhealthy in the following situations:
    • The VM is still booting, and its initial delay has not passed.
    • A significant share of unhealthy instances is currently being autohealed. The autohealer delays further autohealing to ensure that the group keeps running a subset of instances.
  • Detect health check configuration errors. For example, you can detect misconfigured firewall rules or an invalid application health checking endpoint if the instance reports a health state of TIMEOUT.
  • Determine the initial delay value to configure by measuring the amount of time between when the VM transitions to a RUNNING status and when the VM transitions to a HEALTHY health state. You can measure this gap by polling the list-instances method.

Use the console, the gcloud command-line tool, or the API to view health states.

Console

  1. In the Google Cloud Console, go to the Instance groups page.

    Go to the Instance groups page.

  2. Under the Name column of the list, click the name of the instance group that you want to examine. A page opens with the instance group properties and a list of instances that are included in the group.

  3. If an instance is unhealthy, you can see its health state in the Health issues column.

gcloud

Use the list-instances sub-command.

gcloud beta compute instance-groups managed list-instances instance-group
NAME              ZONE                  STATUS   HEALTH_STATE  ACTION  INSTANCE_TEMPLATE                            VERSION_NAME  LAST_ERROR
igm-with-hc-fvz6  europe-west1          RUNNING  HEALTHY       NONE    my-template
igm-with-hc-gtz3  europe-west1          RUNNING  HEALTHY       NONE    my-template

The HEALTH_STATE column shows each instance's health state.

API

For a regional managed instance group, construct a POST request to the listManagedInstances method:

POST https://compute.googleapis.com/compute/beta/projects/project-id/regions/region/instanceGroupManagers/instance-group/listManagedInstances

For a zonal managed instance group, use the zonal managed instance group listManagedInstances method:

POST https://compute.googleapis.com/compute/beta/projects/project-id/zones/zone/instanceGroupManagers/instance-group/listManagedInstances

The request returns a response similar to the following, which includes an instanceHealth field for each managed instance.

{
 "managedInstances": [
  {
   "instance": "https://compute.googleapis.com/compute/beta/projects/project-id/zones/zone/instances/example-group-5485",
   "instanceStatus": "RUNNING",
   "currentAction": "NONE",
   "lastAttempt": {
   },
   "id": "6159431761228150698",
   "instanceTemplate": "https://compute.googleapis.com/compute/beta/projects/project-id/global/instanceTemplates/example-template",
   "version": {
    "instanceTemplate": "https://compute.googleapis.com/compute/beta/projects/project-id/global/instanceTemplates/example-template"
   },
   "instanceHealth": [
    {
     "healthCheck": "https://compute.googleapis.com/compute/beta/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "HEALTHY"
    }
   ]
  },
  {
   "instance": "https://compute.googleapis.com/compute/beta/projects/project-id/zones/zone/instances/example-group-sfdp",
   "instanceStatus": "STOPPING",
   "currentAction": "DELETING",
   "lastAttempt": {
   },
   "id": "6622324799312181783",
   "instanceHealth": [
    {
     "healthCheck": "https://compute.googleapis.com/compute/beta/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "TIMEOUT"
    }
   ]
  }
 ]
}

Health states

The following instance health states are available:

  • HEALTHY: The instance is reachable, a connection to the application health checking endpoint can be established, and the response conforms to the requirements defined by the health check..
  • DRAINING: The instance is being drained. Existing connections to the instance have time to complete, but new connections are being refused.
  • UNHEALTHY: The instance is reachable, but does not conform to the requirements defined by the health check.
  • TIMEOUT: The instance is unreachable: a connection to the application health checking endpoint cannot be established, or the server on a VM instance does not respond within the specified timeout. For example, this may be caused by misconfigured firewall rules or an overloaded server application on a VM instance.
  • UNKNOWN: The health checking system is not aware of the instance or its health is not known at the moment. It can take 15 minutes for monitoring to begin on new instances in a MIG.

New instances will return an UNHEALTHY state until they are verified by the health checking system.

Whether an instance is repaired depends on its health state:

  • If an instance has a health state of UNHEALTHY or TIMEOUT, and it has passed its initialization period, then the autohealing service immediately attempts to repair it.
  • If an instance has a health state of UNKNOWN, then it will not be repaired immediately. This is to prevent an unnecessary repair of an instance for which the health-checking signal is temporarily unavailable.

Autohealing attempts can be delayed if:

  • An instance remains unhealthy after multiple consecutive repairs.
  • A significant overall share of unhealthy instances exists in the group.

We want to learn about your use cases, challenges, or feedback about VM instance health state values. Please share your feedback with our team at mig-discuss@google.com.

Viewing current actions on instances

When a managed instance is in the process of being created, its currentAction is CREATING. If an autohealing policy is attached to the group, once the managed instance is created and running, the instance proceeds to a currentAction of VERIFYING and the health checker begins to probe the instance's application. If the application passes this initial health check within the time that it takes for the application to start, then the instance is verified and its currentAction flips to NONE.

You can see the currentAction being performed and the status of each instance in a managed instance group with the gcloud command-line tool or the API.

gcloud

gcloud compute instance-groups managed list-instances instance-group-name \
[--filter="zone:(zone)" | --filter="region:(region)"]

gcloud returns a list of instances in the instance group and their respective statuses and current actions. For example:

NAME               ZONE           STATUS    ACTION    INSTANCE_TEMPLATE  VERSION_NAME  LAST_ERROR
vm-instances-9pk4  us-central1-f            CREATING  my-new-template
vm-instances-h2r1  us-central1-f  STOPPING  DELETING  my-old-template
vm-instances-j1h8  us-central1-f  RUNNING   NONE      my-old-template
vm-instances-ngod  us-central1-f  RUNNING   NONE      my-old-template

API

In the API, make a GET request to the regionInstanceGroupManagers.listManagedInstances method. For a zonal managed instance group, use the instanceGroupManagers.listManagedInstances method.

GET https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group-name/listManagedInstances

The API returns a list of instances for the group including each instance's instanceStatus and currentAction.

{
 "managedInstances": [
  {
   "instance": "https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/vm-instances-prvp",
   "id": "5317605642920955957",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://compute.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/instance-template-name",
   "currentAction": "REFRESHING"
  },
  {
   "instance": "https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/vm-instances-pz5j",
   "currentAction": "DELETING"
  },
  {
   "instance": "https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/vm-instances-w2t5",
   "id": "2800161036826218547",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://compute.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/instance-template-name",
   "currentAction": "REFRESHING"
  }
 ]
}

For each instance in a managed instance group, the status of the instance is described by its instanceStatus field. To see a list of valid instanceStatus field values, see Checking an instance's status.

If the instance is undergoing some type of change, the currentAction field is populated with one of the following actions to help you track the progress of the change. Otherwise, the currentAction field is NONE.

Possible currentAction values are:

  • ABANDONING. The instance is being removed from the managed instance group.
  • CREATING. The instance is in the process of being created.
  • CREATING_WITHOUT_RETRIES. The instance is being created without retries; if the instance isn't created on the first try, the managed instance group doesn't try to replace the instance again.
  • DELETING. The instance is in the process of being deleted.
  • RECREATING. The instance was deleted and is being replaced.
  • REFRESHING. The instance is being removed from its current target pools and being readded to the list of current target pools (this list might be the same or different from existing target pools).
  • RESTARTING. The instance is in the process of being restarted using the stop and start methods.
  • VERIFYING. The instance has been created and is in the process of being verified.
  • NONE. No actions are being performed on the instance.

Checking whether the MIG is stable

At the group level, Compute Engine populates a read-only field called status that contains an isStable flag.

If all instances in the group are running and healthy (that is, the currentAction of each managed instance is NONE), thenstatus.isStable==true. Remember that the stability of a managed instance group depends on group configurations beyond the autohealing policy; for example, if your group is autoscaled, and if it is currently scaling up, then isStable==false due to the autoscaler operation.

You can verify that a managed instance group is running and healthy by checking the value of the status.isStable field.

gcloud

Use the instance group describe command:

gcloud compute instance-groups managed describe instance-group-name \
    [--zone zone | --region region]

gcloud tool returns detailed information about the instance group including the status.isStable field.

To pause a script until the group is stable, use the wait-until command with the --stable flag. For example:

gcloud beta compute instance-groups managed wait-until instance-group-name \
    --stable \
    [--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable

The command returns after status.isStable is set to true for the group.

API

For a zonal MIG, make a POST request to the instanceGroupManagers.get method:

POST https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group-name/get

For a regional managed instance group, replace zones/zone with regions/region:

POST https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group-name/get

The API returns detailed information about the instance group including the status.isStable field.

status.isStable set to false indicates that changes are active, pending, or that the managed instance group itself is being modified.

status.isStable set to true indicates the following:

  • None of the instances in the managed instance group are undergoing any type of change and the currentAction for all instances is NONE.
  • No changes are pending for instances in the managed instance group.
  • The managed instance group itself is not being modified.

Managed instance groups can be modified in numerous ways. For example:

  • You make a request to roll out a new instance template.
  • You make a request to create, delete, resize or update instances in the group.
  • An autoscaler requests to resize the group.
  • An autohealer resource is replacing one or more unhealthy instances in the managed instance group.
  • In a regional managed instance group, some of the instances are being redistributed.

As soon as all actions are finished, status.isStable is set to true again for that managed instance group.

Viewing historical autohealing operations

You can use the gcloud tool or the API to view past autohealing events.

gcloud

Use the gcloud compute operations list command with a filter to see only the autohealing repair events in your project.

gcloud compute operations list --filter='operationType~compute.instances.repair.*'

For more information about a specific repair operation, use the describe command. For example:

gcloud compute operations describe repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5 --zone us-east1-b

API

For regional managed instance groups, submit a GET request to the regionOperations resource and include a filter to scope the output list to compute.instances.repair.* events.

GET https://compute.googleapis.com/compute/v1/projects/project-id/region/region/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For zonal managed instance groups, use the zoneOoperations resource.

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For more information about a specific repair operation, submit a GET request for that specific operation. For example:

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations/repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5

What makes a good autohealing health check

Health checks used for autohealing should be conservative so they don't preemptively delete and recreate your instances. When an autohealer health check is too aggressive, the autohealer might mistake busy instances for failed instances and unnecessarily restart them, reducing availability.

  • unhealthy-threshold. Should be more than 1. Ideally, set this value to 3 or more. This protects against rare failures like a network packet loss.
  • healthy-threshold. A value of 2 is sufficient for most apps.
  • timeout. Set this time value to a generous amount (five times or more than the expected response time). This protects against unexpected delays like busy instances or a slow network connection.
  • check-interval. This value should be between 1 second and two times the timeout (not too long nor too short). When a value is too long, a failed instance is not caught soon enough. When a value is too short, the instances and the network can become measurably busy, given the high number of health check probes being sent every second.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Compute Engine Documentation