Set up an application health check and autohealing


This document describes how to set up health check for an application running on each VM and enable autohealing to repair unhealthy instances. It also describes how to check the current health state of each VM.

You can configure an application-based health check to verify that your application is responding as expected. If you configure an application-based health check and the health check determines that your application isn't responding, the MIG repairs that VM. Repairing a VM based on the application health check is called autohealing.

To know about how a MIG automatically repairs VMs, see About repairing VMs in a MIG.

Pricing

When you set up an application-based health check, by default Compute Engine writes a log entry whenever a managed instance's health state changes. Cloud Logging provides a free allotment per month after which logging is priced by data volume. To avoid costs, you can disable the health state change logs.

Set up a health check and an autohealing policy

You can apply a single health check to a maximum of 50 MIGs. If you have more than 50 groups, create multiple health checks. In a MIG, you can set only one autohealing policy to configure a health check.

The following example shows how to use a health check on a MIG. In this example, you create a health check that looks for a web server response on port 80. To enable the health check probes to reach each web server, you configure a firewall rule. Finally, you apply the health check to the MIG by setting the group's autohealing policy.

Console

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    1. In the Google Cloud console, go to the Create a health check page.

      Go to Create a health check

    2. Give the health check a name, such as example-check.

    3. For Protocol, make sure that HTTP is selected.

    4. For Port, enter 80.

    5. For Check interval, enter 5.

    6. For Timeout, enter 5.

    7. Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy VM is marked as healthy. Enter 1 for this example.

    8. Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must be returned before a healthy VM is marked as unhealthy. Enter 3 for this example.

    9. Click Create to create the health check.

  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your network firewall rules allow the health check to connect. For this example, our MIG uses the default network and its VMs are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    1. In the Google Cloud console, go to the Create a firewall rule page.

      Go to Create a firewall rule

    2. For Name, enter a name for the firewall rule. For example, allow-health-check.

    3. For Network, select the default network.

    4. For Source filter, select IP ranges.

    5. For Source IP ranges, enter 130.211.0.0/22 and 35.191.0.0/16.

    6. In Protocols and ports, select Specified protocols and ports and enter tcp:80.

    7. Click Create.

  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    1. In the Google Cloud console, go to the Instance groups page.

      Go to Instance groups

    2. Under the Name column of the list, click the name of the MIG where you want to apply the health check.

    3. Click Edit to modify this MIG.

    4. In the VM instance lifecycle section, under Autohealing, select the health check that you created previously.

    5. Change or keep the Initial delay setting. The initial delay is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM's currentAction field changes to VERIFYING. The value of initial delay must be between 0 and 3600 seconds. In the console, the default value is 300.

    6. Click Save to apply your changes.

gcloud

To use the command-line examples in this guide, install the Google Cloud CLI, or use a Cloud Shell.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    gcloud compute health-checks create http example-check --port 80 \
           --check-interval 30s \
           --healthy-threshold 1 \
           --timeout 10s \
           --unhealthy-threshold 3
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses the default network, and its VMs listen on port 80. If port 80 isn't already open on the default network, create a firewall rule.

    gcloud compute firewall-rules create allow-health-check \
            --allow tcp:80 \
            --source-ranges 130.211.0.0/22,35.191.0.0/16 \
            --network default
  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    Use the update command to apply the health check to the MIG.

    The initial-delay setting is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM's currentAction field changes to VERIFYING. The value of initial delay must be between 0 and 3600 seconds. The default value is 0.

    For example:

    gcloud compute instance-groups managed update my-mig \
            --health-check example-check \
            --initial-delay 300 \
            --zone us-east1-b

API

To use the API examples in this guide, set up API access.

  1. Create a health check for autohealing that is more conservative than a load balancing health check.

    For example, create a health check that looks for a response on port 80 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.

    POST https://compute.googleapis.com/compute/v1/projects/project-id/global/healthChecks
    
    {
     "name": "example-check",
     "type": "http",
     "port": 80,
     "checkIntervalSec": 30,
     "healthyThreshold": 1,
     "timeoutSec": 10,
     "unhealthyThreshold": 3
    }
    
  2. Create a firewall rule to allow health check probes to connect to your app.

    Health check probes come from addresses in the ranges 130.211.0.0/22 and 35.191.0.0/16, so make sure your firewall rules allow the health check to connect. For this example, our MIG uses the default network and its VMs are listening on port 80. If port 80 is not already open on the default network, create a firewall rule.

    POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/firewalls
    
    {
     "name": "allow-health-check",
     "network": "https://www.googleapis.com/compute/v1/projects/[PROJECT_ID]/global/networks/default",
     "sourceRanges": [
      "130.211.0.0/22",
      "35.191.0.0/16"
     ],
     "allowed": [
      {
       "ports": [
        "80"
       ],
       "IPProtocol": "tcp"
      }
     ]
    }
    
  3. Apply the health check by configuring an autohealing policy for your regional or zonal MIG.

    An autohealing policy is part of an instanceGroupManager resource or regionInstanceGroupManager resource.

    You can set an autohealing policy using the insert or patch methods.

    The following example sets an autohealing policy by using the instanceGroupManagers.patch method.

    PATCH https://compute.googleapis.com/compute/projects/[PROJECT_ID]/zones/[ZONE]/instanceGroupManagers/[INSTANCE_GROUP]
    {
      "autoHealingPolicies": [
        {
          "healthCheck": "global/healthChecks/example-check",
          "initialDelaySec": 300
        }
      ],
    }
    

    The initialDelaySec setting is the number of seconds that a new VM takes to initialize and run its startup script. During a VM's initial delay period, the MIG ignores unsuccessful health checks because the VM might be in the startup process. This prevents the MIG from prematurely recreating a VM. If the health check receives a healthy response during the initial delay, it indicates that the startup process is complete and the VM is ready. The initial delay timer starts when the VM's currentAction field changes to VERIFYING. The value of initial delay must be between 0 and 3600 seconds. The default value is 0.

    To turn off application-based autohealing, set the autohealing policy to an empty value, autoHealingPolicies[]. With autoHealingPolicies[], the MIG recreates only VMs that are not in a RUNNING state.

    You can get the autohealing policy of a MIG by reading the instanceGroupManagers.autoHealingPolicies field. You can get a MIG resource using one of the following methods:

After the group creation or health check configuration update completes, it can take 30 minutes before autohealing begins monitoring instances in the group. Once monitoring begins, Compute Engine begins to mark instances as healthy (or else recreates them) based on your autohealing configuration. For example, if you configure an initial delay of 5 minutes, a health check interval of 1 minute, and a healthy threshold of 1 check, the timeline looks like the following:

  • 30 minute delay before autohealing begins monitoring instances in the group
  • + 5 minutes for the configured initial delay
  • + 1 minute for the check interval * healthy threshold (60s * 1)
  • = 36 minutes before the instance is either marked as healthy or is recreated

Checking the status

You can verify that a VM is created and its application is responding by inspecting the current health state of each VM, by checking the current action on each VM, or by checking the group's status.

Checking whether VMs are healthy

If you have configured an application-based health check for your MIG, you can review the health state of each managed instance.

Inspect your managed instance health states to:

  • Identify unhealthy VMs that are not being autohealed. A VM might not be repaired immediately even if it has been diagnosed as unhealthy in the following situations:
    • The VM is still booting, and its initial delay has not passed.
    • A significant share of unhealthy instances is currently being autohealed. The autohealer delays further autohealing to ensure that the group keeps running a subset of instances.
  • Detect health check configuration errors. For example, you can detect misconfigured firewall rules or an invalid application health checking endpoint if the instance reports a health state of TIMEOUT.
  • Determine the initial delay value to configure by measuring the amount of time between when the VM transitions to a RUNNING status and when the VM transitions to a HEALTHY health state. You can measure this gap by polling the list-instances method or by observing the time between instances.insert operation and the first healthy signal received.

Use the console, the gcloud command-line tool, or the API to view health states.

Console

  1. In the Google Cloud console, go to the Instance groups page.

    Go to Instance groups.

  2. Under the Name column of the list, click the name of the MIG that you want to examine. A page opens with the instance group properties and a list of VMs that are included in the group.

  3. If a VM is unhealthy, you can see its health state in the Health check status column.

gcloud

Use the list-instances sub-command.

gcloud compute instance-groups managed list-instances instance-group
NAME              ZONE                  STATUS   HEALTH_STATE  ACTION  INSTANCE_TEMPLATE                            VERSION_NAME  LAST_ERROR
igm-with-hc-fvz6  europe-west1          RUNNING  HEALTHY       NONE    my-template
igm-with-hc-gtz3  europe-west1          RUNNING  HEALTHY       NONE    my-template

The HEALTH_STATE column shows each VM's health state.

API

For a regional MIG, construct a POST request to the listManagedInstances method:

POST https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group/listManagedInstances

For a zonal MIG, use the zonal MIG listManagedInstances method:

POST https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group/listManagedInstances

The request returns a response similar to the following, which includes an instanceHealth field for each managed instance.

{
 "managedInstances": [
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-5485",
   "instanceStatus": "RUNNING",
   "currentAction": "NONE",
   "lastAttempt": {
   },
   "id": "6159431761228150698",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template",
   "version": {
    "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/project-id/global/instanceTemplates/example-template"
   },
   "instanceHealth": [
    {
     "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "HEALTHY"
    }
   ]
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/project-id/zones/zone/instances/example-group-sfdp",
   "instanceStatus": "STOPPING",
   "currentAction": "DELETING",
   "lastAttempt": {
   },
   "id": "6622324799312181783",
   "instanceHealth": [
    {
     "healthCheck": "https://www.googleapis.com/compute/v1/projects/project-id/global/healthChecks/http-basic-check",
     "detailedHealthState": "TIMEOUT"
    }
   ]
  }
 ]
}

Health states

The following VM health states are available:

  • HEALTHY: The VM is reachable, a connection to the application health checking endpoint can be established, and the response conforms to the requirements defined by the health check.
  • DRAINING: The VM is being drained. Existing connections to the VM have time to complete, but new connections are being refused.
  • UNHEALTHY: The VM is reachable, but does not conform to the requirements defined by the health check.
  • TIMEOUT: The VM is unreachable, a connection to the application health checking endpoint cannot be established, or the server on a VM does not respond within the specified timeout. For example, this may be caused by misconfigured firewall rules or an overloaded server application on a VM.
  • UNKNOWN: The health checking system is not aware of the VM or its health is not known at the moment. It can take 30 minutes for monitoring to begin on new VMs in a MIG.

New VMs return an UNHEALTHY state until they are verified by the health checking system.

Whether a VM is repaired depends on its health state:

  • If a VM has a health state of UNHEALTHY or TIMEOUT, and it has passed its initialization period, then the autohealing service immediately attempts to repair it.
  • If a VM has a health state of UNKNOWN, then it will not be repaired immediately. This is to prevent an unnecessary repair of a VM for which the health checking signal is temporarily unavailable.

Autohealing attempts can be delayed if:

  • A VM remains unhealthy after multiple consecutive repairs.
  • A significant overall share of unhealthy VMs exists in the group.

We want to learn about your use cases, challenges, or feedback about VM health state values. Please share your feedback with our team at mig-discuss@google.com.

Viewing current actions on VMs

When a MIG is currently in the process of creating a VM instance, the MIG sets that instance's read-only currentAction field to CREATING. If an autohealing policy is attached to the group, once the VM is created and running, the MIG sets the instance's current action to VERIFYING and the health checker begins to probe the VM's application. If the application passes this initial health check within the time that it takes for the application to start, then the VM is verified and the MIG changes the VM's currentAction field to NONE.

Use the Google Cloud CLI or the Compute Engine API to see details about the instances in a managed instance group. Details include instance status and current actions that the group is performing on its instances.

gcloud

All managed instances

To check the status and current actions on all instances in the group, use the list-instances command.

gcloud compute instance-groups managed list-instances INSTANCE_GROUP_NAME \
    [--zone=ZONE | --region=REGION]

The command returns a list of instances in the group, including their status, current actions, and other details:

NAME               ZONE           STATUS   HEALTH_STATE  ACTION  INSTANCE_TEMPLATE  VERSION_NAME  LAST_ERROR
vm-instances-9pk4  us-central1-f                          CREATING  my-new-template
vm-instances-h2r1  us-central1-f  STOPPING                DELETING  my-old-template
vm-instances-j1h8  us-central1-f  RUNNING                 NONE      my-old-template
vm-instances-ngod  us-central1-f  RUNNING                 NONE      my-old-template

The HEALTH_STATE column appears empty unless you have set up health checking.

A specific managed instance

To check the status and current action for a specific instance in the group, use the describe-instance command.

gcloud compute instance-groups managed describe-instance INSTANCE_GROUP_NAME \
    --instance INSTANCE_NAME \
    [--zone=ZONE | --region=REGION]

The command returns details about the instance, including instance status, current action, and, for stateful MIGs, preserved state:

currentAction: NONE
id: '6789072894767812345'
instance: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-a/instances/example-mig-hz41
instanceStatus: RUNNING
name: example-mig-hz41
preservedStateFromConfig:
  metadata:
    example-key: example-value
preservedStateFromPolicy:
  disks:
    persistent-disk-0:
      autoDelete: NEVER
      mode: READ_WRITE
      source: https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-a/disks/example-mig-hz41
version:
  instanceTemplate: https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template

API

Call the listManagedInstances method on a regional or zonal MIG resource. For example, to see details about the instances in a zonal MIG resource, you can make the following request:

GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instanceGroupManagers/INSTANCE_GROUP_NAME/listManagedInstances

The call returns a list of instances for the MIG including each instance's instanceStatus and currentAction.

{
 "managedInstances": [
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-prvp",
   "id": "5317605642920955957",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template",
   "currentAction": "REFRESHING"
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-pz5j",
   "currentAction": "DELETING"
  },
  {
   "instance": "https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/instances/vm-instances-w2t5",
   "id": "2800161036826218547",
   "instanceStatus": "RUNNING",
   "instanceTemplate": "https://www.googleapis.com/compute/v1/projects/example-project/global/instanceTemplates/example-template",
   "currentAction": "REFRESHING"
  }
 ]
}

To see a list of valid instanceStatus field values, see VM instance lifecycle.

If an instance is undergoing some type of change, the managed instance group sets the instance's currentAction field to one of the following actions to help you track the progress of the change. Otherwise, the currentAction field is set to NONE.

Possible currentAction values are:

  • ABANDONING. The instance is being removed from the MIG.
  • CREATING. The instance is in the process of being created.
  • CREATING_WITHOUT_RETRIES. The instance is being created without retries; if the instance isn't created on the first try, the MIG doesn't try to replace the instance again.
  • DELETING. The instance is in the process of being deleted.
  • RECREATING. The instance is being replaced.
  • REFRESHING. The instance is being removed from its current target pools and being readded to the list of current target pools (this list might be the same or different from existing target pools).
  • RESTARTING. The instance is in the process of being restarted using the stop and start methods.
  • VERIFYING. The instance has been created and is in the process of being verified.
  • NONE. No actions are being performed on the instance.

Checking whether the MIG is stable

At the group level, Compute Engine populates a read-only field called status that contains an isStable flag.

If all VMs in the group are running and healthy (that is, the currentAction field for each managed instance is set to NONE), then the MIG sets the status.isStable field to true. Remember that the stability of a MIG depends on group configurations beyond the autohealing policy; for example, if your group is autoscaled, and if it is currently scaling in or out, then the MIG sets the status.isStable field to false due to the autoscaler operation.

Verify that all instances in a managed instance group are running and healthy by checking the value of the group's status.isStable field.

gcloud

Use the describe command:

gcloud compute instance-groups managed describe instance-group-name \
    [--zone zone | --region region]

The gcloud CLI returns detailed information about the MIG including its status.isStable field.

To pause a script until the MIG is stable, use the wait-until command with the --stable flag. For example:

gcloud compute instance-groups managed wait-until instance-group-name \
    --stable \
    [--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable

The command returns after status.isStable is set to true for the MIG.

API

For a zonal MIG, make a GET request to the instanceGroupManagers.get method:

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/instanceGroupManagers/instance-group-name/get

For a regional managed instance group, replace zones/zone with regions/region:

GET https://compute.googleapis.com/compute/v1/projects/project-id/regions/region/instanceGroupManagers/instance-group-name/get

The Compute Engine API returns detailed information about the MIG including its status.isStable field.

status.isStable set to false indicates that changes are active, pending, or that the MIG itself is being modified.

status.isStable set to true indicates the following:

  • None of the instances in the MIG are undergoing any type of change and the currentAction for all instances is NONE.
  • No changes are pending for instances in the MIG.
  • The MIG itself is not being modified.

Remember that the stability of a MIG depends on numerous factors because a MIG can be modified in numerous ways. For example:

  • You make a request to roll out a new instance template.
  • You make a request to create, delete, resize or update instances in the MIG.
  • An autoscaler requests to resize the MIG.
  • An autohealer resource is replacing one or more unhealthy instances in the MIG.
  • In a regional MIG, some of the instances are being redistributed.

As soon as all actions are finished, status.isStable is set to true again for that MIG.

Viewing historical autohealing operations

You can use the gcloud CLI or the API to view past autohealing events.

gcloud

Use the gcloud compute operations list command with a filter to see only the autohealing repair events in your project.

gcloud compute operations list --filter='operationType~compute.instances.repair.*'

For more information about a specific repair operation, use the describe command. For example:

gcloud compute operations describe repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5 --zone us-east1-b

API

For regional MIGs, submit a GET request to the regionOperations resource and include a filter to scope the output list to compute.instances.repair.* events.

GET https://compute.googleapis.com/compute/v1/projects/project-id/region/region/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For zonal MIGs, use the zoneOperations resource.

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations?filter=operationType+%3D+%22compute.instances.repair.*%22

For more information about a specific repair operation, submit a GET request for that specific operation. For example:

GET https://compute.googleapis.com/compute/v1/projects/project-id/zones/zone/operations/repair-1539070348818-577c6bd6cf650-9752b3f3-1d6945e5

What makes a good autohealing health check

Health checks used for autohealing should be conservative so they don't preemptively delete and recreate your instances. When an autohealer health check is too aggressive, the autohealer might mistake busy instances for failed instances and unnecessarily restart them, reducing availability.

  • unhealthy-threshold. Should be more than 1. Ideally, set this value to 3 or more. This protects against rare failures like a network packet loss.
  • healthy-threshold. A value of 2 is sufficient for most apps.
  • timeout. Set this time value to a generous amount (five times or more than the expected response time). This protects against unexpected delays like busy instances or a slow network connection.
  • check-interval. This value should be between 1 second and two times the timeout (not too long nor too short). When a value is too long, a failed instance is not caught soon enough. When a value is too short, the instances and the network can become measurably busy, given the high number of health check probes being sent every second.

What's next