Using autohealing for highly available applications

This interactive tutorial shows how to use autohealing to build highly available applications on Compute Engine.

Highly available applications are designed to serve clients with minimal latency and downtime. Availability is compromised when an application crashes or freezes. Clients of a compromised application may experience high latency or downtime.

Autohealing allows you to automatically restart applications that are compromised. It promptly detects failed instances and recreates them automatically, so clients can be served again. With autohealing, you no longer need to manually bring an application back to service after a failure.

Objectives

  • Configure a health check and an autohealing policy.
  • Set up a demo web service on a managed instance group.
  • Simulate health check failures and witness the autohealing recovery process.

Costs

This tutorial uses billable components of GCP including:

  • Compute Engine

Before you begin

    Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

    Select or create a GCP project.

    Go to the project selector page

    Make sure that billing is enabled for your Google Cloud Platform project.

    Learn how to enable billing

    Enable the Compute Engine API.

    Enable the API

If you prefer to work from the command line, install the gcloud command-line tool.

Application architecture

The application includes the following Compute Engine components:

  • Health check: an HTTP health check policy used by the autohealer to detect failed VM instances.
  • Firewall rules: GCP firewall rules let you allow or deny traffic to your instances.
  • Managed instance group: A group of instances running the same demo web service.
  • Instance template: A template used to create each instance in the instance group.

System architecture diagram showing a health check and an instance group

How the health check probes the demo webservice

A health check sends probe requests to an instance using a specified protocol, such as HTTP(S), SSL, or TCP. For more information, see the documentation on how health checks work and health check categories, protocols, and ports.

The health check in this tutorial is an HTTP health check probing the HTTP path /health on port 80. For an HTTP health check, the probe request passes only if the path returns a HTTP 200 (OK) response. For this tutorial, the demo web server defines the path /health to return a HTTP 200 (OK) response when healthy or a HTTP 500 (Internal Server Error) response when unhealthy. For more information, see the documentation on success criteria for HTTP, HTTPS, and HTTP/2.

Create the health check

To set up autohealing, create a custom health check and configure the network firewall to allow health check probes.

Console

  1. Create a health check.

    1. Go to the Health checks page in the GCP Console.
      Go to the Health checks page
    2. Click Create health check.
    3. Set Name to autohealer-check
    4. For Protocol select HTTP
    5. Set Request path to /health. This indicates what HTTP path the health check uses. For this tutorial, the demo web server defines the path /health to return either a HTTP 200 (OK) response when healthy or a HTTP 500 (Internal Server Error) response when unhealthy.
    6. Set the Health criteria:
      1. Set Check interval to 10. This defines the amount of time from the start of one probe to the start of the next one.
      2. Set Timeout to 5. This defines the amount of time that GCP will wait for a response to a probe. Its value must be less than or equal to the check interval.
      3. Set Healthy threshold to 2. This defines the number of sequential probes that must succeed in order for the instance to be considered healthy.
      4. Set Unhealthy threshold to 3. This defines the number of sequential probes that must fail in order for the instance to be considered unhealthy.
    7. Click Create at the bottom.
  2. Create a firewall rule to allow health check probes to make HTTP requests.

    1. Go to the Create firewall rule page in the GCP Console.
      Go to the Create firewall rule page
    2. For Name, enter default-allow-http-health-check
    3. For Network, select default
    4. For Targets, select All instances in the network
    5. For Source filter, select IP ranges
    6. For Source IP ranges, enter 130.211.0.0/22 and 35.191.0.0/16
    7. In Protocols and ports, select tcp and enter 80
    8. Click Create.

gcloud

  1. Create a health check.

    gcloud compute health-checks create http autohealer-check \
        --check-interval 10 \
        --timeout 5 \
        --healthy-threshold 2 \
        --unhealthy-threshold 3 \
        --request-path "/health"
    
    • check-interval defines the amount of time from the start of one probe to the start of the next one.
    • timeout defines the amount of time that GCP will wait for a response to a probe. Its value must be less than or equal to the check interval.
    • healthy-threshold defines the number of sequential probes that must succeed in order for the instance to be considered healthy.
    • unhealthy-threshold defines the number of sequential probes that must fail in order for the instance to be considered unhealthy.
    • request-path indicates what HTTP path the health check uses. For this tutorial, the demo web server defines the path /health to return either a HTTP 200 (OK) response when healthy or a HTTP 500 (Internal Server Error) response when unhealthy.
  2. Create a firewall rule to allow health check probes to make HTTP requests.

    gcloud compute firewall-rules create default-allow-http-health-check \
        --network default \
        --allow tcp:80 \
        --source-ranges 130.211.0.0/22,35.191.0.0/16
    

What makes a good autohealing health check

Health checks used for autohealing should be conservative so they do not preemptively delete and recreate your instances. When an autohealer health check is too aggressive, the autohealer may mistake busy instances for failed instances and unnecessarily restart them, reducing availability.

  • unhealthy-threshold: should be more than 1, ideally 3 or more. This protects against rare failures like a network packet loss.
  • healthy-threshold: a value of 2 is good enough for most applications.
  • timeout: should be much longer (5x or more) than the expected response time. This protects against unexpected delays like a busy instances or a slow network connection.
  • check-interval: should be not too long (2x the timeout) and not too short (less than 1 second). Too long and a failed instance is not caught soon enough. Too short and the instances and the network may become measurably busy by the high number of health check probes being sent every second.

Set up the web service

This tutorial uses a web application that is stored on GitHub. If you would like learn more about how the application was implemented, see the Google Cloud Platform GitHub repository.

To set up the demo web service, create an instance template that launches the demo web server on startup. Then, use this instance template to deploy a managed instance group and enable autohealing.

Console

  1. Create an instance template. Include a startup script that starts up the demo web server.

    1. Go to the Instance templates page in the GCP Console.
      Go to the Instance templates page
    2. Click Create instance template.
    3. Set the Name to webserver-template
    4. For Machine configuration select micro (f1-micro).
    5. Under Firewall, select the Allow HTTP traffic checkbox.
    6. Click Management, security, disks, networking, sole tenancy to to reveal advanced settings. You should see a number of tabs.
    7. Under the Management tab, find Automation and enter the following Startup script:
      sudo apt-get update && sudo apt-get install git gunicorn3 python3-pip -y
      git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
      cd python-docs-samples/compute/managed-instances/demo
      sudo pip3 install -r requirements.txt
      sudo gunicorn3 --bind 0.0.0.0:80 app:app --daemon
      
    8. Click Create at the bottom of the page.
  2. Deploy the web server as a managed instance group.

    1. Go to the Instance groups page in the GCP Console.
      Go to the Instance groups page
    2. Click Create instance group.
    3. Set the Name to webserver-group
    4. For Region select europe-west1
    5. For Zone select europe-west1-b
    6. For Instance template select webserver-template
    7. For Autoscaling select Off.
    8. Set Number of instances to 3
    9. For Health check select autohealer-check
    10. Set Initial delay to 90
    11. Click Create.
  3. Create a firewall rule that will allow HTTP requests to the web servers.

    1. Go to the Create firewall rule page in the GCP Console.
      Go to the Create firewall rule page
    2. For Name, enter default-allow-http
    3. For Network, select default
    4. For Targets, select Specified target tags
    5. For Target Tags, enter http-server
    6. For Source filter, select IP ranges
    7. For Source IP ranges, enter 0.0.0.0/0
    8. In Protocols and ports, select tcp and enter 80
    9. Click Create.

gcloud

  1. Create an instance template. Include a startup script that starts up the demo web server.

    gcloud compute instance-templates create webserver-template \
        --machine-type f1-micro \
        --tags http-server \
        --metadata startup-script='
      sudo apt-get update && sudo apt-get install git gunicorn3 python3-pip -y
      git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
      cd python-docs-samples/compute/managed-instances/demo
      sudo pip3 install -r requirements.txt
      sudo gunicorn3 --bind 0.0.0.0:80 app:app --daemon'
    
  2. Create an instance group.

    gcloud compute instance-groups managed create webserver-group \
        --zone europe-west1-b \
        --template webserver-template \
        --size 3 \
        --health-check autohealer-check \
        --initial-delay 90
    
  3. Create a firewall rule that will allow HTTP requests to the web servers.

    gcloud compute firewall-rules create default-allow-http \
        --network default \
        --allow tcp:80 \
        --target-tags http-server
    

Simulate health check failures

To simulate health check failures, the demo web server provides ways for you to force a health check failure.

Console

  1. Navigate to a web server instance.

    1. Go to the VM instances page in the GCP Console.
      Go to the VM instances page
    2. Under the External IP column, click the ip address for any webserver-group instance. A new tab should open in your web browser. If the request times out or web page is not available, wait a minute to let the server finish setting up and try again.

    The demo web server displays a page similar to the following:

    Simple demo web page showing green status buttons and blue action buttons

  2. On the demo web page, click Make unhealthy.

    This causes the web server to fail the health check. Specfically, the web server makes the /health path return a HTTP 500 (Internal Server Error). You can verify this yourself by quickly clicking the Check health button (this will stop working after the autohealer has started rebooting the instance).

  3. Wait for the autohealer to take action.

    1. Go to the VM instances page in the GCP Console.
      Go to the VM instances page
    2. Wait for the status of the web server instance to change. The green checkmark next to the instance name should change to a grey square, indicating the autohealer has started rebooting the unhealthy instance.
    3. Click Refresh at the top of the page periodically to get the most recent status.
    4. The autohealing process is finished when the grey square changes back to a green checkmark, indicating the instance is healthy again.

gcloud

  1. Monitor the status of the instance group. (Use Ctrl-C to stop when finished).

    while : ; do \
        gcloud compute instance-groups managed list-instances webserver-group \
        --zone europe-west1-b \
        ; done
    
    NAME                 ZONE            STATUS   ACTION  INSTANCE_TEMPLATE   VERSION_NAME  LAST_ERROR
    webserver-group-d5tz  europe-west1-b  RUNNING  NONE    webserver-template
    webserver-group-q6t9  europe-west1-b  RUNNING  NONE    webserver-template
    webserver-group-tbpj  europe-west1-b  RUNNING  NONE    webserver-template
    

    If any instances show a status that is not RUNNING, such as STAGING, wait a minute to let the instance finish setting up and try again.

  2. Open a new shell session with gcloud installed.

  3. Get the address of a web server instance.

    gcloud compute instances list --filter webserver-group
    

    Under the EXTERNAL_IP column, copy the IP address of any web server instance and save it as a local bash variable.

    export IP_ADDRESS=EXTERNAL_IP_ADDRESS
    
  4. Verify the web server has finished setting up. The server should return a HTTP 200 OK response.

    curl --head $IP_ADDRESS/health
    
    HTTP/1.1 200 OK
    Server: gunicorn/19.6.0
    ...
    

    If you get a Connection refused error, wait a minute to let the server finish setting up and try again.

  5. Make the web server unhealthy.

    curl $IP_ADDRESS/makeUnhealthy > /dev/null
    

    This causes the web server to fail the health check. Specfically, the web server makes the /health path return a HTTP 500 INTERNAL SERVER ERROR. You can verify this yourself by quickly making a request to /health (this will stop working after the autohealer has started rebooting the instance).

    curl --head $IP_ADDRESS/health
    
    HTTP/1.1 500 INTERNAL SERVER ERROR
    Server: gunicorn/19.6.0
    ...
    
  6. Return to your first shell session to monitor the instance group and wait for the autohealer to take action.

    1. When the autohealing process has started, the STATUS and ACTION columns will update, indicating the autohealer has started rebooting the unhealthy instance.

      NAME                 ZONE            STATUS    ACTION      INSTANCE_TEMPLATE   VERSION_NAME  LAST_ERROR
      webserver-group-d5tz  europe-west1-b  RUNNING   NONE        webserver-template
      webserver-group-q6t9  europe-west1-b  RUNNING   NONE        webserver-template
      webserver-group-tbpj  europe-west1-b  STOPPING  RECREATING  webserver-template
      
    2. The autohealing process has finished when the instance once again reports a STATUS of RUNNING and an ACTION of NONE, indicating the instance has been successfully restarted.

      NAME                 ZONE            STATUS   ACTION  INSTANCE_TEMPLATE   VERSION_NAME  LAST_ERROR
      webserver-group-d5tz  europe-west1-b  RUNNING  NONE    webserver-template
      webserver-group-q6t9  europe-west1-b  RUNNING  NONE    webserver-template
      webserver-group-tbpj  europe-west1-b  RUNNING  NONE    webserver-template
      
    3. When finished monitoring the instance group, use Ctrl-C to quit.

Feel free to repeat this exercise. Here are some ideas:

  • What happens if you make all instances unhealthy at once? For more information about autohealing behavior during concurrent failures, see the documentation on autohealing behavior.

  • Can you update the health check configuration to heal instances as fast as possible? (In practice, you should set the health check parameters to use conservative values as explained in this tutorial. Otherwise, you may risk instances being mistakenly deleted and restarted when there is no real problem.)

  • The instance group has an initial delay configuration setting. Can you determine the minimum delay needed for this demo webserver? (In practice, you should set the delay to somewhat longer (10-20%) than it takes for an instance to boot and start serving application requests. Otherwise, you may risk the instance getting stuck in an autohealing boot loop.)

(Optional) View autohealer history

You can view a history of autohealer operations with the gcloud command.

gcloud compute operations list --filter='operationType~compute.instances.repair.*'

For more information, see the documentation on viewing historical autohealing operations

Cleaning up

After you've finished the autohealing tutorial, you can clean up the resources that you created on GCP so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

If you created a separate project for this tutorial, delete the entire project. Otherwise, if the project has resources that you want to keep, only delete the specific resources created in this tutorial.

Deleting the project

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting specific resources

If you cannot delete the project used for this tutorial, delete the tutorial resources individually.

Deleting the instance group

console

  1. In the GCP Console, go to the Instance groups page.

    Go to the Instance groups page

  2. Click the checkbox for your webserver-group instance group.
  3. Click Delete to delete the instance group.

gcloud

gcloud compute instance-groups managed delete webserver-group --zone europe-west1-b -q

Deleting the instance template

console

  1. Go to the Instance Templates page in the GCP Console.

    Go to the Instance templates page

  2. Click the checkbox next to the instance template.

  3. Click Delete at the top of the page. In the new window, click Delete to confirm the deletion.

gcloud

gcloud compute instance-templates delete webserver-template -q

Deleting the health check

console

  1. Go to the Health Checks page in the GCP Console.

    Go to the Health Checks page

  2. Click the checkbox next to the health check.

  3. Click Delete at the top of the page. In the new window, click Delete to confirm the deletion.

gcloud

gcloud compute health-checks delete autohealer-check -q

Deleting the firewall rules

console

  1. Go to the Firewall Rules page in the GCP Console.

    Go to the Firewall Rules page

  2. Click the checkboxes next to the firewall rules named default-allow-http and default-allow-http-health-check.

  3. Click Delete at the top of the page. In the new window, click Delete to confirm the deletion.

gcloud

gcloud compute firewall-rules delete default-allow-http default-allow-http-health-check -q

What's next

Var denne siden nyttig? Si fra hva du synes:

Send tilbakemelding om ...

Compute Engine Documentation