Deploy a cold recoverable application server with regional persistent disks

This document shows you how to implement a cold failover pattern that uses a regional persistent disk that's attached to a VM in a managed instance group to let an application keep running in the event of a zone failure.

To run reliable applications in Google Cloud, design your application infrastructure to handle outages. Depending on your application and business needs, you might need a cold failover, warm failover, or hot failover pattern. For more information on how to determine the best approach for your own applications, see the disaster recovery planning guide.

A managed instance group creates one VM with an attached regional persistent disk behind a load balancer.

In this scenario, the data written to the regional persistent disk is replicated continuously to another zone in the same region. If there's a failure in a zone, the managed instance group fails the VM over to another zone and attaches the replicated regional persistent disk. For more information on the storage replication process for disks, see High availability options using regional persistent disks.

The regional instance group has recreated a VM instance in another zone and reattached the regional persistent disk.

This scenario balances the cost of running multiple VMs with maintaining a certain level of data protection. To reduce your storage costs, consider deploying a cold recoverable application using persistent disk snapshots instead.

The following table outlines some high-level differences in data protection options for cold recoverable approaches that use regional persistent disks or persistent disk snapshots. For more information, see High availability options using persistent disks.

Regional persistent disks Persistent disk snapshots
Data loss - recovery point objective (RPO) Zero for a single failure, such as sustained outage in a zone or network disconnect. Any data since the last snapshot was taken, which is typically one hour or more.
Recovery time objective (RTO) Deployment time for a new VM, plus several seconds for the regional persistent disk to be reattached. Deployment time for a new VM, plus time to create a new persistent disk from the latest snapshot.

The disk create time depends on the size of the snapshot, and could take tens of minutes or hours.
Cost Storage costs double as the regional persistent disk is replicated continuously to another zone. You only pay for the amount of snapshot space consumed.
For more information, see Disks and images pricing.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine API.

    Enable the API

  5. Install and initialize the Cloud SDK.

You can run the gcloud command-line tool in the Cloud Console without installing the Cloud SDK. To run the gcloud tool in the Cloud Console, use the Cloud Shell.

Prepare the environment

To get started, define some variables for your resource names and locations. These variables are used by the gcloud command-line tool commands as you deploy the resources.

Throughout this tutorial, unless otherwise noted, you enter all commands in Cloud Shell or your local development environment.

  1. Replace PROJECT_ID with your own project ID. If desired, provide your own name suffix for resources, such as app.

    Specify a region, such as us-central1, and two zones within that region, such as us-central1-a and us-central1-f. These two zones define where the regional persistent disk and managed instance group are deployed.

    PROJECT_ID=PROJECT_ID
    NAME_SUFFIX=app
    REGION=us-central1
    ZONE1=us-central1-a
    ZONE2=us-central1-f
    

Create a VPC and subnet

To provide network access to the VMs, create a Virtual Private Cloud (VPC) and subnet. As the managed instance group works across zones within a single region, only one subnet is created. For more information on the advantages of the custom subnet mode to manage IP address ranges in use in your environment, see Use custom mode VPC networks.

  1. Create the VPC with a custom subnet mode:

    gcloud compute networks create network-$NAME_SUFFIX \
        --subnet-mode=custom
    
  2. Now create a subnet in the new VPC. Define your own address range, such as 10.1.0.0/20, that fits in your network range:

    gcloud compute networks subnets create subnet-$NAME_SUFFIX-$REGION \
        --network=network-$NAME_SUFFIX \
        --range=10.1.0.0/20 \
        --region=$REGION
    

Create firewall rules

To let network traffic flow correctly in the VPC, use firewall rules.

  1. First, create firewall rules to allow web traffic and health checks for the load balancer and managed instance group.

    The following HTTP rule allows traffic to any VM where the http-server tag is applied, and from any source using the 0.0.0.0/0 range. For the health check rule, default ranges for Google Cloud are set to allow the platform to correctly check the health of resources.

    gcloud compute firewall-rules create allow-http-$NAME_SUFFIX \
        --network=network-$NAME_SUFFIX \
        --direction=INGRESS \
        --priority=1000 \
        --action=ALLOW \
        --rules=tcp:80 \
        --source-ranges=0.0.0.0/0 \
        --target-tags=http-server
    
    gcloud compute firewall-rules create allow-health-check-$NAME_SUFFIX \
        --network=network-$NAME_SUFFIX \
        --action=allow \
        --direction=ingress \
        --source-ranges=130.211.0.0/22,35.191.0.0/16 \
        --target-tags=allow-health-check \
        --rules=tcp:80
    
  2. To allow SSH traffic for the initial configuration of a base VM image, scope the firewall rule to your environment using the --source-range parameter. You might need to work with your network team to determine what source ranges your organization uses.

    Replace IP_ADDRESS_SCOPE with your own IP address scopes:

    gcloud compute firewall-rules create allow-ssh-$NAME_SUFFIX \
        --network=network-$NAME_SUFFIX \
        --direction=INGRESS \
        --priority=1000 \
        --action=ALLOW \
        --rules=tcp:22 \
        --source-ranges=IP_ADDRESS_SCOPE
    
  3. After you create the firewall rules, verify that the three rules have been added:

    gcloud compute firewall-rules list \
        --project=$PROJECT_ID \
        --filter="NETWORK=network-$NAME_SUFFIX"
    

    The following example output shows the three rules have been correctly created:

    NAME                    NETWORK      DIRECTION  PRIORITY  ALLOW
    allow-health-check-app  network-app  INGRESS    1000      tcp:80
    allow-http-app          network-app  INGRESS    1000      tcp:80
    allow-ssh-app           network-app  INGRESS    1000      tcp:22
    

Create a regional persistent disk and VM

A regional persistent disk provides continuous replication of data between two zones in a region. A regional managed instance group, created across the same two regions as the regional persisted disk, can then attach the disk to a VM.

  1. Create a 10 GiB SSD. Understand your storage needs and the associated costs of paying for the provisioned space, not consumed space. For more information, see persistent disk pricing.

    gcloud compute disks create disk-$NAME_SUFFIX \
        --region $REGION \
        --replica-zones $ZONE1,$ZONE2 \
        --size=10 \
        --type=pd-ssd
    
  2. Now create a base VM and attach the regional persistent disk created in the previous step. In the next sections you configure this VM with a basic website and create a custom image that's deployed in the managed instance group. The parameters defined at the start of this document are used to name the VM and connect to the correct subnet. Names are also assigned from the parameters for the boot disk and data disk.

    gcloud compute instances create vm-base-$NAME_SUFFIX \
        --zone=$ZONE1 \
        --machine-type=n1-standard-1 \
        --subnet=subnet-$NAME_SUFFIX-$REGION \
        --tags=http-server \
        --image=debian-10-buster-v20210316 \
        --image-project=debian-cloud \
        --boot-disk-size=10GB \
        --boot-disk-type=pd-balanced \
        --boot-disk-device-name=vm-base-$NAME_SUFFIX \
        --disk=mode=rw,name=disk-$NAME_SUFFIX,device-name=disk-$NAME_SUFFIX,scope=regional
    

Configure the base VM image

The base VM is used to create a custom image in the next section. This custom image is then used for each VM created in the managed instance group. This scenario uses a basic Apache webserver, but the same approach to the infrastructure deployment applies to other single-VM application environments.

On the VM, create a basic index.html file on the persistent disk that's mounted to /var/www/example.com. An Apache configuration file at /etc/apache2/sites-available/example.com.conf is then used to serve web content from the mounted persistent disk location.

The following diagram shows the basic HTML page served by Apache that's stored on the persistent disk. You build this environment in the following steps.

The VM has a basic HTML page stored on the persistent disk with an Apache configuration file to load from the mounted disk location.

  1. To install and configure the simple website, first connect to the base VM using SSH:

    gcloud compute ssh vm-base-$NAME_SUFFIX --zone=$ZONE1
    
  2. In your SSH session to the VM, make a directory for a basic example website and set the appropriate permissions. The persistent disk is mounted to this location in the next steps:

    sudo mkdir -p /var/www/example.com
    sudo chmod a+w /var/www/example.com
    sudo chown -R www-data: /var/www/example.com
    
  3. To format and mount the persistent disk on the VM, first create a variable to match the NAME_SUFFIX value set at the start of this document, such as app:

    NAME_SUFFIX=app
    
  4. Use find to get the underlying device ID in /dev/disk/by-id, such as /dev/sdb, and assign it to the DISK_PATH variable:

    DISK_NAME="google-disk-$NAME_SUFFIX"
    DISK_PATH="$(find /dev/disk/by-id -name "${DISK_NAME}" | xargs -I '{}' readlink -f '{}')"
    
  5. Finally, create a file system and mount the device:

    sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard $DISK_PATH
    sudo mount -o discard,defaults $DISK_PATH /var/www/example.com
    

    For more information, see how to format and mount a persistent disk.

  6. With the regional persistent disk formatted and mounted, write out a basic HTML page to the web server directory. As this file is written to the persistent disk, the same page is served when the persistent disk is later moved between VMs to simulate a failure.

    sudo tee -a /var/www/example.com/index.html >/dev/null <<'EOF'
    <!doctype html>
    <html lang=en>
    <head>
    <meta charset=utf-8>
        <title>HA / DR example</title>
    </head>
    <body>
        <p>Welcome to a test web server with regional persistent disks!</p>
    </body>
    </html>
    EOF
    
  7. To serve the basic web page, first install Apache:

    sudo apt-get update && sudo apt-get -y install apache2
    
  8. Now write out a basic Apache virtual host configuration file. This configuration defines a website for www.example.com with a DocumentRoot at /var/www/example.com. The persistent disk is mounted to this location in a later step and on startup for VMs in the managed instance group.

    sudo tee -a /etc/apache2/sites-available/example.com.conf >/dev/null <<'EOF'
    <VirtualHost *:80>
        ServerName www.example.com
    
        ServerAdmin webmaster@localhost
        DocumentRoot /var/www/example.com
    
        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined
    </VirtualHost>
    EOF
    
  9. To see the basic website in action, disable the default Apache website, enable the new one, then reload the configuration:

    sudo a2dissite 000-default
    sudo a2ensite example.com.conf
    sudo systemctl reload apache2
    
  10. Exit the SSH session to the VM:

    exit
    
  11. Finally, get the IP address of the VM. Use curl to see the basic web page displayed. You won't connect directly to the VM like this when the scenario is fully deployed, but this step confirms that Apache is configured correctly and the page is loaded from the attached persistent disk. In the next steps, you create an image using this base VM and configure an instance template with a startup script.

    curl $(gcloud compute instances describe vm-base-$NAME_SUFFIX \
        --zone $ZONE1 \
        --format="value(networkInterfaces.accessConfigs.[0].natIP)")
    

    The basic website is returned, as shown in the following example output:

    <!doctype html>
    <html lang=en>
    <head>
    <meta charset=utf-8>
        <title>HA / DR example</title>
    </head>
    <body>
        <p>Welcome to a test web server with regional persistent disks!</p>
    </body>
    </html>
    

Create a VM image and instance template

To create identical VMs that can be automatically deployed without additional configuration required, you use a custom VM image. This image captures the OS and Apache configuration, and is used when each VM is created in the managed instance group in the next steps.

  1. Before you can create an image, first stop the VM:

    gcloud compute instances stop vm-base-$NAME_SUFFIX --zone=$ZONE1
    
  2. Now create an image of the base VM configured in the previous section:

    gcloud compute images create image-$NAME_SUFFIX \
        --source-disk=vm-base-$NAME_SUFFIX \
        --source-disk-zone=$ZONE1 \
        --storage-location=$REGION
    
  3. The managed instance group needs an instance template that defines how each VM should be configured. The image created in the previous step is defined as the source for each VM. A startup script is also defined that mounts an attached persistent disk.

    gcloud compute instance-templates create template-$NAME_SUFFIX \
        --machine-type=n1-standard-1 \
        --subnet=projects/$PROJECT_ID/regions/$REGION/subnetworks/subnet-$NAME_SUFFIX-$REGION \
        --metadata=^,@^startup-script=\!\#\ /bin/bash$'\n'echo\ UUID=\`blkid\ -s\ UUID\ -o\ value\ /dev/sdb\`\ /var/www/example.com\ ext4\ discard,defaults,nofail\ 0\ 2\ \|\ tee\ -a\ /etc/fstab$'\n'mount\ -a \
        --region=$REGION \
        --tags=http-server \
        --image=image-$NAME_SUFFIX
    

Create a regional managed instance group

A regional managed instance group is used to run the VMs. The managed instance group runs across two zones in a region, and monitors the health of the VMs. If there's a zone outage and the VM stops running, the managed instance group can create another VM in a different zone and reattach the regional persistent disk.

  1. First, create a health check to monitor the VMs in the managed instance group. This health check makes sure the VM responds on port 80. For your own applications, monitor the appropriate ports to confirm VM health.

    gcloud compute health-checks create http http-basic-check-$NAME_SUFFIX --port 80
    
  2. Next, create a managed instance group, initially with zero VMs. The instance template created in the previous step is used. The same two zones that are configured for the regional persistent disk are also used to make sure that VMs can attach the persistent disk.

    gcloud compute instance-groups managed create instance-group-$NAME_SUFFIX \
        --base-instance-name=instance-vm-$NAME_SUFFIX \
        --template=template-$NAME_SUFFIX \
        --size=0 \
        --region=$REGION \
        --zones=$ZONE1,$ZONE2 \
        --instance-redistribution-type=none \
        --health-check=http-basic-check-$NAME_SUFFIX
    
  3. Now create a single VM in the managed instance group and attach the regional persistent disk. If there's a failure of this VM, the managed instance group recreates it and reattaches the persistent disk.

    gcloud compute instance-groups managed create-instance instance-group-$NAME_SUFFIX \
        --instance instance-vm-$NAME_SUFFIX \
        --region $REGION \
        --stateful-disk device-name=disk-$NAME_SUFFIX,source=projects/$PROJECT_ID/regions/$REGION/disks/disk-$NAME_SUFFIX
    

For this cold recoverable application scenario, don't create autoscale rules to increase the number of VMs that run in the managed instance group.

Create and configure a load balancer

For users to access your website, you need to allow traffic through to the VM that runs in the managed instance group. You also want to automatically redirect traffic to the new VM if there's a zone failure in the managed instance group.

To direct web traffic through to the managed instance group and handle VM failovers, create a load balancer. The following steps create a backend service for HTTP traffic on port 80, uses the health check created in the previous steps, and maps an external IP address through to the backend service.

For more information, see how to set up a simple external HTTP load balancer.

  1. Create and configure the load balancer for your application:

    gcloud compute instance-groups set-named-ports instance-group-$NAME_SUFFIX \
        --named-ports http:80 \
        --region $REGION
    
    gcloud compute backend-services create web-backend-service-$NAME_SUFFIX \
        --protocol=HTTP \
        --port-name=http \
        --health-checks=http-basic-check-$NAME_SUFFIX \
        --global
    
    gcloud compute backend-services add-backend web-backend-service-$NAME_SUFFIX \
        --instance-group=instance-group-$NAME_SUFFIX \
        --instance-group-region=$REGION \
        --global
    
    gcloud compute url-maps create web-map-http-$NAME_SUFFIX \
        --default-service web-backend-service-$NAME_SUFFIX
    
    gcloud compute target-http-proxies create http-lb-proxy-$NAME_SUFFIX \
        --url-map web-map-http-$NAME_SUFFIX
    
    gcloud compute forwarding-rules create http-content-rule-$NAME_SUFFIX \
        --global \
        --target-http-proxy=http-lb-proxy-$NAME_SUFFIX \
        --ports=80
    
  2. With the load balancer created, get the IP address of the forwarding rule for the web traffic. You can browse to this IP address to see the website in action.

    IP_ADDRESS=$(gcloud compute forwarding-rules describe http-content-rule-$NAME_SUFFIX \
        --global \
        --format="value(IPAddress)")
    
  3. Now use curl, or open your web browser, to access the IP address of the load balancer from the previous step. It takes a few minutes for the load balancer to finish deploying and to correctly direct traffic to your backend. An HTTP 404 error is returned if the load balancer is still deploying. If needed, wait a few minutes and try to access the website again.

    curl $IP_ADDRESS
    

    The basic website is returned, as shown in the following example output:

    <!doctype html>
    <html lang=en>
    <head>
    <meta charset=utf-8>
        <title>HA / DR example</title>
    </head>
    <body>
        <p>Welcome to a test web server with regional persistent disks!</p>
    </body>
    </html>
    

Simulate a zone failure and recovery

Let's review the resource deployments before simulating a zone failure. All of the resources have been created to support the following environment:

A managed instance group behind a load balancer with one VM instance and a connected regional persistent disk.

  • A load balancer sits in front of a regional managed instance group. The managed instance group can distribute VMs across two zones.
  • One VM runs inside the managed instance group, with a regional persistent disk attached to it that stores a basic website.
  • The regional persistent disk is replicated between the same two zones that the managed instance group can deploy VMs to.
  • A health check monitors the status of the VMs inside the managed instance group.

Now simulate a failure and see how the managed instance group uses the cold failover pattern to restore the application environment.

  1. First, check the health status of the managed instance group:

    gcloud compute instance-groups managed list-instances instance-group-$NAME_SUFFIX \
        --region $REGION
    

    The following example output shows the status of the VM as RUNNING and HEALTHY:

    NAME             ZONE           STATUS   HEALTH_STATE  ACTION
    instance-vm-app  us-central1-a  RUNNING  HEALTHY       NONE
    
  2. To simulate a zone failure, connect to the VM using SSH and then shut down the VM from inside the OS. By doing the shutdown inside the VM, the managed instance group doesn't initiate the shutdown so responds to the unhealthy behavior like it would during a real failure.

    Connect to the VM using SSH and specify the NAME and ZONE displayed from the previous command.

    gcloud compute ssh NAME --zone ZONE
    
  3. From inside the SSH session, halt the VM:

    sudo halt
    

    Your SSH connection to the VM automatically closes as part of the shutdown process. Complete the rest of the steps in your regular terminal session.

  4. Check the health status of the managed instance group and verify the VM reports that it has timed out:

    gcloud compute instance-groups managed list-instances instance-group-$NAME_SUFFIX \
        --region $REGION
    

    The following example output shows the VM's HEALTH_STATE as TIMEOUT:

    NAME             ZONE           STATUS   HEALTH_STATE  ACTION
    instance-vm-app  us-central1-a  RUNNING  TIMEOUT       NONE
    
  5. With the VM in an unhealthy state, try to access the website again using curl or your web browser and the IP address of the load balancer:

    curl $IP_ADDRESS
    

    Because the VM is unreachable, the load balancer returns an HTTP 502 server error, as shown in the following example output. This behavior continues until the managed instance group recovers from the outage.

    <html><head>
    <meta http-equiv="content-type" content="text/html;charset=utf-8">
    <title>502 Server Error</title>
    </head>
    <body text=#000000 bgcolor=#ffffff>
    <h1>Error: Server Error</h1>
    <h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
    <h2></h2>
    </body></html>
    
  6. Keep rerunning the list-instances command every few seconds to see how the managed instance group responds to the failure. The STATUS and HEALTH_STATE go through various stages such as:

    • STOPPING TIMEOUT
    • STAGING CREATING
    • RUNNING UNKNOWN
    • and finally RUNNING HEALTHY

    It takes a minute or two to complete this cycle and for the health check to report a HEALTHY instance status. The following example output shows some of the responses to the health check through these stages:

    $ gcloud compute instance-groups managed list-instances \
        instance-group-app --region us-central1
    
    NAME             ZONE           STATUS    HEALTH_STATE  ACTION
    instance-vm-app  us-central1-a  STOPPING  TIMEOUT       RECREATING
    
    $ gcloud compute instance-groups managed list-instances \
        instance-group-app --region us-central1
    
    NAME             ZONE           STATUS   HEALTH_STATE  ACTION
    instance-vm-app  us-central1-a  RUNNING  UNKNOWN       VERIFYING
    
    $ gcloud compute instance-groups managed list-instances \
        instance-group-app --region us-central1
    
    NAME            ZONE           STATUS   HEALTH_STATE  ACTION
    instance-vm-app us-central1-a  RUNNING  HEALTHY       NONE
    

    In an actual zone outage, the recreated VM would be in a different zone. In this simulated failure, the managed instance group might recreate the VM in the same zone since there's no underlying infrastructure problem. The following diagram shows the managed instance group has created another VM, in a different zone, but with the same regional persistent disk attached:

    The regional instance group has recreated a VM instance in another zone and reattached the regional persistent disk.

  7. When the new VM reports as HEALTHY, the web page is correctly displayed again. Use curl or your web browser to access the IP address of the load balancer one more time:

    curl $IP_ADDRESS
    

    This correct response shows that the instance configuration and regional persistent disk attachment has been successful. The following example response shows the web page correctly running on the recreated VM:

    <!doctype html>
    <html lang=en>
    <head>
    <meta charset=utf-8>
        <title>HA / DR example</title>
    </head>
    <body>
        <p>Welcome to a test web server with regional persistent disks!</p>
    </body>
    </html>
    

The VM image, instance template, and regional persistent disk maintain all the configuration for the application VM. The managed instance group handles the failover if the health check reports an unhealthy VM. There's no manual steps you need to take for this failover and cold recovery to happen.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

To delete the individual resources created in this document, complete the following steps.

  1. Delete the load balancer configuration:

    gcloud compute forwarding-rules delete \
        http-content-rule-$NAME_SUFFIX --global --quiet
    
    gcloud compute target-http-proxies delete \
        http-lb-proxy-$NAME_SUFFIX --quiet
    
    gcloud compute url-maps delete web-map-http-$NAME_SUFFIX --quiet
    
    gcloud compute backend-services delete \
        web-backend-service-$NAME_SUFFIX --global --quiet
    
  2. Delete the managed instance group and health check:

    gcloud compute instance-groups managed delete instance-group-$NAME_SUFFIX \
        --region=$REGION --quiet
    
    gcloud compute health-checks delete http-basic-check-$NAME_SUFFIX --quiet
    
  3. Delete the instance template, image, base VM, and persistent disk:

    gcloud compute instance-templates delete template-$NAME_SUFFIX --quiet
    
    gcloud compute images delete image-$NAME_SUFFIX --quiet
    
    gcloud compute instances delete vm-base-$NAME_SUFFIX --zone=$ZONE1 --quiet
    
    gcloud compute disks delete disk-$NAME_SUFFIX --region=$REGION --quiet
    
  4. Delete the firewall rules:

    gcloud compute firewall-rules delete allow-health-check-$NAME_SUFFIX --quiet
    
    gcloud compute firewall-rules delete allow-ssh-$NAME_SUFFIX --quiet
    
    gcloud compute firewall-rules delete allow-http-$NAME_SUFFIX --quiet
    
  5. Delete the subnet and VPC:

    gcloud compute networks subnets delete \
        subnet-$NAME_SUFFIX-$REGION --region=$REGION --quiet
    
    gcloud compute networks delete network-$NAME_SUFFIX --quiet
    

What's next

This document showed you how to provide disaster recovery for a cold recoverable application using a regional persistent disk. Depending on your business needs and budget, you can deploy a cold recoverable application using persistent disk snapshots instead.

For more information on how to determine the best approach for your own applications and which recovery method to use, see the disaster recovery planning guide.