Regional persistent disk failover

Regional persistent disks provide synchronous replication of data between two zones in a region. Regional persistent disks can be a good building block to use when you implement high availability (HA) services in Compute Engine. Regional persistent disks are also designed to work with regional managed instance groups.

Failure scenarios

With regional persistent disks, data is automatically replicated to two zones in a region. Temporary hiccups in regional disk operations are transparently handled by a regional persistent disk. A regional persistent disk automatically detects errors and slowness and performs catch up of data replicated only to one zone.

If both replicas are available, a write is acknowledged back to a VM when it is durably persisted in both replicas. If one of the replicas is unavailable, a write is acknowledged after it is durably persisted in the healthy replica. If and when the unhealthy replica is back up (as detected by Compute Engine), then it is brought in sync with the healthy replica. This operation is transparent to a VM.

To prevent unintended data loss in the event that both replicas become unavailable at the same time, we recommend that you back up your regional persistent disks regularly using snapshots.

Zonal failures

A regional persistent disk is replicated in two zones:

  • One replica is located in the same zone as the VM instance to which it is attached (the primary zone).
  • The other replica is located in an alternate zone in the same region (the secondary zone).

In the event that the primary zone fails, you can fail over your regional persistent disk to a VM instance in another zone by using the --force-attach flag with the attach-disk command.

In this scenario, you might not be able to detach a disk from the instance because the instance can't be reached to perform the detach operation. Force-attach lets you attach a regional persistent disk to a VM instance even if that disk is currently attached to another instance.

After you complete the force-attach operation, Compute Engine prevents the original VM from writing to the disk. Using force-attach lets you safely regain access to your data and recover your service. You also have the option to manually shut down the instance after you perform the force-attach step.

In the event that the secondary zone fails, the unhealthy replica comes back into sync with the healthy replica automatically when the secondary zone recovers.

Initial state Failure New state Action
Two healthy zones The primary zone fails
  • The healthy replica has all disk data.
  • The newly unhealthy replica is not guaranteed to have all disk data.
Force attach the disk to a VM in the healthy zone.
Two healthy zones The secondary zone fails
  • The unhealthy replica is not guaranteed to have all disk data until the zone recovers.
No action needed. The unhealthy replica is brough back into sync when the zone recovers.
  • One healthy zone
  • One unhealthy zone
The healthy zone fails
  • Both replicas are in unhealthy zones and cannot serve traffic. The disk is unavailable.
  • If the zonal outage is temporary, no data is lost.
  • If the zonal outage is permanent, data that was written only to the healthy replica is permanently lost.
  • We do not recommend force attaching because the disk cannot serve traffic.
  • You cannot create a snapshot of the disk until the zone recovers. As a best practice, back up the regional persistent disks regularly using snapshots.

Application and VM failures

In the event of outages caused by VM misconfiguration, an unsuccessful OS upgrade, or other application failures, you can force-attach your regional persistent disk to a VM instance in the same zone.

Failure category and (probability) Failure types Action
Application failure (High) Application unresponsive
Application admin actions (for example, upgrade)
Human error
(for example, misconfiguration of parameters such as SSL certificate or ACLs.
Application control plane can trigger failover based on health check thresholds.
VM failure (Medium) Infrastructure/hardware failure
VM unresponsive due to CPU contention, intermediate network interruption
VMs are usually autohealed. The application control plane can trigger failover based on health check thresholds.
Application corruption (Low-Medium) Application data corruption
(for example, due to application bugs or an unsuccessful OS upgrade)
Application recovery:

Failover your regional persistent disk using force-attach

Console

Create a standby VM instance and force-attach a disk to the instance.

  1. In the Google Cloud Console, go to the VM instances page.

    Go to VM instances

  2. Select your project.

  3. Click Create.

  4. Specify a Name for your instance.

  5. Select the region where your regional persistent disk resides.

  6. Select the zone for the standby VM instance.

  7. Click Management, disks, networking, SSH keys.

  8. Click Disks.

  9. In the Additional Disks section, click Attach existing disk.

  10. Select the regional persistent disk from the list.

  11. Select the checkbox to force attach the disk.

  12. Click Done.

  13. Click Create to finish creating this instance. The new VM instance appears on the VM instances page.

You can perform the same steps to force-attach a disk to the original instance after the failure is resolved.

gcloud

In the gcloud tool, use the instances attach-disk command to attach the replica disk to a VM instance. Include the --disk-scope flag and set it to regional.

gcloud compute instances attach-disk INSTANCE_NAME  \
    --disk DISK_NAME --disk-scope regional \
    --force-attach

Replace the following:

  • INSTANCE_NAME: the name of the new VM instance in the region
  • DISK_NAME: the name of the disk

After you force-attach the disk, mount the file systems on the disk, if necessary. The instance can use the force-attached disk to continue read and write operations.

API

Construct a POST request to the compute.instances.attachDisk method method, and include the URL to the persistent disk that you just created. To attach the disk to the new VM instance, the forceAttach=true query parameter is required, even though the primary instance still has the disk.

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/INSTANCE_NAME/attachDisk?forceAttach=true

{
 "source": "projects/PROJECT_ID/regions/REGION/disks/DISK_NAME"
}

Replace the following:

  • PROJECT_ID: your project ID
  • ZONE: the location of your instance
  • INSTANCE_NAME: the name of the instance where you are adding the new persistent disk
  • REGION: the region where your new regional persistent disk is located
  • DISK_NAME: the name of the new disk

After you attach the replica disk, mount the file systems on the disks if necessary. The instance can use the replica disk to continue read and write operations.

What's next