Manage failures for regional Persistent Disk


Regional Persistent Disk is a storage option that provides synchronous replication of data between two zones in a region. You can use regional Persistent Disk as a building block when you implement high availability (HA) services in Compute Engine.

This document explains the various scenarios that can disrupt the working of your regional Persistent Disk volume and how you can manage these scenarios.

Before you begin

  • Review the basics about regional Persistent Disk zonal replication and failover. For more information, see About regional Persistent Disk.
  • If you haven't already, set up authentication. Authentication is the process by which your identity is verified for access to Google Cloud services and APIs. To run code or samples from a local development environment, you can authenticate to Compute Engine as follows.

    Select the tab for how you plan to use the samples on this page:

    gcloud

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
    2. Set a default region and zone.

    REST

    To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

      Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init

Failure scenarios

With regional persistent disks, when the device is fully replicated, data is automatically replicated to two zones in a region. A write is acknowledged back to a virtual machine (VM) instance when it is durably persisted in both replicas.

If replication to one zone fails or is very slow for a while, the disk replication status switches to degraded. In this mode, write is acknowledged after it is durably persisted in one replica.

If and when Compute Engine detects that replication can be resumed, data previously written since the device entered the degraded state is synced to both zones and the disk returns to a fully replicated state. This transition is fully automated.

RPO and RTO are undefined while a device is in a degraded state. To minimize data and/or availability loss in the event of a failure of a disk operating in a degraded state, we recommend that you back up your regional persistent disks regularly using standard snapshots. You can recover a disk by restoring the snapshot.

Zonal failures

A regional Persistent Disk volume is synchronously replicated to disk replicas in the primary and secondary zones. Zonal failures happen when a zonal replica goes down and becomes unavailable. Zonal failures can happen in either of the zones due to one of the following reasons:

  • There is a zonal outage.
  • The replica experiences excessive slowness in write operations.

The following table provides the various zonal failure scenarios that you might encounter for regional Persistent Disk and the recommended action for each scenario. In each of these scenarios, it is assumed that your primary zonal replica is healthy and synced during the initial state.

Initial state of the disk Failure in New state of the disk Consequences of failure Action to take

Primary replica: Synced

Secondary replica: Synced

Disk status: Fully replicated

Disk attached in: primary zone

Primary zone

Primary replica: Out of sync or unavailable

Secondary replica: Synced

Disk status: Degraded

Disk attached in: primary zone

  • The replica in the secondary zone remains healthy and has the latest disk data.
  • The replica in the primary zone is unhealthy and is not guaranteed to have all the disk data.
Failover the disk by force-attaching to a VM in the healthy secondary zone.

Primary replica: Synced

Secondary replica: Synced

Disk status: Fully replicated

Disk attached in: primary zone

Secondary zone

Primary replica: Synced

Secondary replica: Out of sync or unavailable

Disk status: Degraded

Disk attached in: primary zone

  • The replica in the primary zone remains healthy and has the latest disk data.
  • The replica in the secondary zone is unhealthy and is not guaranteed to have all the disk data.
No action needed. Compute Engine brings the unhealthy replica in the secondary zone back into sync after it is available again.

Primary replica: Synced

Secondary replica: Out of sync and unavailable

Disk status: Degraded

Disk attached in: primary zone

Primary zone

Primary replica: Synced but unavailable

Secondary replica: Out of sync

Disk status: Unavailable

Disk attached in: primary zone

  • Both zonal replicas are unavailable and cannot serve traffic. The disk becomes unavailable.
  • If the zonal outage or replica failure is temporary, then no data is lost.
  • If the zonal outage or replica failure is permanent, then any data written to the healthy replica while the disk was degraded is permanently lost.
Google recommends that you use an existing standard snapshot and create a new disk to recover your data. As a best practice, back up the regional Persistent Disk volumes regularly using standard snapshots.

Primary replica: Synced

Secondary replica: Catching up but available

Disk status: Catching up

Disk attached in: primary zone

Primary zone

Primary replica: Unavailable

Secondary replica: Catching up but available

Disk status: Unavailable

Disk attached in: primary zone

  • Both zonal replicas cannot serve traffic. The disk becomes unavailable.
  • If the zonal outage or replica failure is temporary, then your disk resumes operations after the primary replica is available again.
  • If the zonal outage or replica failure is permanent, your disk becomes unusable.

Primary replica: Synced

Secondary replica: Out of sync but available

Disk status: Degraded

Disk attached in: primary zone

Primary zone

Primary replica: Unavailable

Secondary replica: Out of sync but available

Disk status: Unavailable

Disk attached in: primary zone

  • Both zonal replicas cannot serve traffic. The disk becomes unavailable.
  • If the zonal outage or replica failure is temporary, then your disk resumes operations after the primary replica is available again.
  • If the zonal outage or replica failure is permanent, your disk becomes unusable.

Application and VM failures

In the event of outages caused by VM misconfiguration, an unsuccessful OS upgrade, or other application failures, you can force-attach your regional Persistent Disk volume to a VM instance in the same zone.

Failure category and (probability) Failure types Action
Application failure (High)
  • Unresponsive applications
  • Failure due to application administrative actions (for example, upgrade)
  • Human error (for example, misconfiguration of parameters such as SSL certificate or ACLs)
Application control plane can trigger failover based on health check thresholds.
VM failure (Medium)
  • Infrastructure or hardware failure
  • VM unresponsive due to CPU contention, intermediate network interruption
VMs are usually autohealed. The application control plane can trigger failover based on health check thresholds.
Application corruption (Low-Medium) Application data corruption
(for example, due to application bugs or an unsuccessful OS upgrade)
Application recovery:

Failover your regional Persistent Disk volume using force-attach

In the event that the primary zone fails, you can fail over your regional Persistent Disk volume to a VM in another zone by using a force-attach operation. When there's a failure in the primary zone, you might not be able to detach the disk from the VM because the VM can't be reached to perform the detach operation. Force-attach operation lets you attach a regional Persistent Disk volume to a VM even if that volume is attached to another VM. After you complete the force-attach operation, Compute Engine prevents the original VM from writing to the regional Persistent Disk volume. Using the force-attach operation lets you safely regain access to your data and recover your service. You also have the option to manually shut down the VM instance after you perform the force-attach operation.

To force attach an existing disk to a VM, perform the following steps:

Console

  1. Go to the VM instances page.

    Go to VM instances

  2. Select your project.

  3. Click the name of the VM you want to change.

  4. On the details page, click Edit.

  5. In the Additional disks section, click Attach additional disk.

  6. Select the regional Persistent Disk volume from the drop-down list.

  7. To force attach the disk, select the Force-attach disk checkbox.

  8. Click Done, and then click Save.

You can perform the same steps to force-attach a disk to the original VM after the failure is resolved.

gcloud

In the gcloud CLI, use the instances attach-disk command to attach the replica disk to a VM instance. Include the --disk-scope flag and set it to regional.

gcloud compute instances attach-disk VM_NAME \
    --disk DISK_NAME --disk-scope regional \
    --force-attach

Replace the following:

  • VM_NAME: the name of the new VM instance in the region
  • DISK_NAME: the name of the disk

After you force-attach the disk, mount the file systems on the disk, if necessary. The VM instance can use the force-attached disk to continue read and write operations.

REST

Construct a POST request to the compute.instances.attachDisk method, and include the URL to the Persistent Disk volume that you just created. To attach the disk to the new VM instance, the forceAttach=true query parameter is required, even though the primary VM instance still has the disk.

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/VM_NAME/attachDisk?forceAttach=true

{
 "source": "projects/PROJECT_ID/regions/REGION/disks/DISK_NAME"
}

Replace the following:

  • PROJECT_ID: your project ID
  • ZONE: the location of your VM instance
  • VM_NAME: the name of the VM instance where you are adding the new Persistent Disk volume
  • REGION: the region where your new regional Persistent Disk volume is located
  • DISK_NAME: the name of the new disk

After you attach the replica disk, mount the file systems on the disks if necessary. The VM instance can use the replica disk to continue read and write operations.

Use replica recovery checkpoint to recover degraded regional Persistent Disk volumes

A replica recovery checkpoint represents the most recent crash-consistent point in time of a fully replicated regional Persistent Disk volume. Compute Engine lets you create standard snapshots from the replica recovery checkpoint for degraded disks.

In rare scenarios, when your disk is degraded, the zonal replica that is is synced with the latest disk data can also fail before the out-of-sync replica catches up. You won't be able to force-attach your disk to VMs in either zone. Your regional Persistent Disk volume becomes unavailable and you must migrate the data to a new disk. In such scenarios, if you don't have any existing standard snapshots available for your disk, you might still be able to recover your disk data from the incomplete replica by using a standard snapshot created from the replica recovery checkpoint. See Procedure to migrate and recover disk data for detailed steps.

Required roles

To get the permissions that you need to migrate regional Persistent Disk data using a replica recovery checkpoint, ask your administrator to grant you the following IAM roles:

  • To migrate regional Persistent Disk data using a replica recovery checkpoint: Compute Instance Admin (v1) (roles/compute.instanceAdmin.v1) on the project

For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to migrate regional Persistent Disk data using a replica recovery checkpoint. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to migrate regional Persistent Disk data using a replica recovery checkpoint:

  • To create a standard snapshot from the replica recovery checkpoint:
    • compute.snapshots.create on the project
    • compute.disks.createSnapshot on the disk
  • To create a new regional Persistent Disk from the standard snapshot: compute.disks.create on the project where you want to create the new disk
  • To migrate VMs to the new disk:
    • compute.instances.attachDisk on the VM instance
    • compute.disks.use permission on the newly created disk

You might also be able to get these permissions with custom roles or other predefined roles.

Procedure to migrate and recover disk data

To recover and migrate the data of a regional Persistent Disk volume by using the replica recovery checkpoint, perform the following steps:

  1. Create a standard snapshot of the impacted regional Persistent Disk volume from its replica recovery checkpoint. You can create the standard snapshot for a disk from its replica recovery checkpoint only by using the Google Cloud CLI or REST.

    gcloud

    To create a snapshot using the replica recovery checkpoint, use the gcloud compute snapshots create command . Include the --source-disk-for-recovery-checkpoint flag to specify that you want to create the snapshot using a replica recovery checkpoint. Exclude the --source-disk and --source-disk-region parameters.

    gcloud compute snapshots create SNAPSHOT_NAME \
        --source-disk-for-recovery-checkpoint=SOURCE_DISK \
        --source-disk-for-recovery-checkpoint-region=SOURCE_REGION \
        --storage-location=STORAGE_LOCATION \
        --snapshot-type=SNAPSHOT_TYPE
    

    Replace the following:

    • DESTINATION_PROJECT_ID: The ID of project in which you want to create the snapshot.
    • SNAPSHOT_NAME: A name for the snapshot.
    • SOURCE_DISK: The name or full path of the source disk that you want to use to create the snapshot. To specify the full path of a source disk, use the following syntax:
        projects/SOURCE_PROJECT_ID/regions/SOURCE_REGION/disks/SOURCE_DISK_NAME
        

      If you specify the full path to the source disk, you can exclude the --source-disk-for-recovery-checkpoint-region flag. If you specify only the disk's name, then you must include this flag.

      To create a snapshot from the recovery checkpoint of a source disk in a different project, you must specify the full path to the source disk.

    • SOURCE_PROJECT_ID: The project ID of the source disk whose checkpoint you want to use to create the snapshot.
    • SOURCE_REGION: The region of the source disk whose checkpoint you want to use to create the snapshot.
    • SOURCE_DISK_NAME: The name of the source disk whose checkpoint you want to use to create the snapshot.
    • STORAGE_LOCATION: Optional: The Cloud Storagemulti-region or the Cloud Storageregion where you want to store your snapshot. You can specify only one storage location.
      Use the --storage-location flag only if you want to override the predefined or customized default storage location configured in your snapshot settings.
    • SNAPSHOT_TYPE: The snapshot type, either STANDARD or ARCHIVE. If a snapshot type is not specified, a STANDARD snapshot is created.

    You can use replica recovery checkpoint to create a snapshot only on degraded disks. If you try to create a snapshot from a replica recovery checkpoint when the device is fully replicated, you see the following error message:

    The device is fully replicated and should not create snapshots out of a recovery checkpoint. Please
    create regular snapshots instead.
    

    REST

    To create a snapshot using the replica recovery checkpoint, make a POST request to the snapshots.insert method. Exclude the sourceDisk parameter and instead include the sourceDiskForRecoveryCheckpoint parameter to specify that you want to create the snapshot using the checkpoint.

    POST https://compute.googleapis.com/compute/v1/projects/DESTINATION_PROJECT_ID/global/snapshots
    
    {
      "name": "SNAPSHOT_NAME",
      "sourceDiskForRecoveryCheckpoint": "projects/SOURCE_PROJECT_ID/regions/SOURCE_REGION/disks/SOURCE_DISK_NAME",
      "storageLocations": "STORAGE_LOCATION",
      "snapshotType": "SNAPSHOT_TYPE"
    }
    

    Replace the following:

    • DESTINATION_PROJECT_ID: The ID of project in which you want to create the snapshot.
    • SNAPSHOT_NAME: A name for the snapshot.
    • SOURCE_DISK: The name or full path of the source disk that you want to use to create the snapshot. To specify the full path of a source disk, use the following syntax:
        projects/SOURCE_PROJECT_ID/regions/SOURCE_REGION/disks/SOURCE_DISK_NAME
        

      If you specify the full path to the source disk, you can exclude the --source-disk-for-recovery-checkpoint-region flag. If you specify only the disk's name, then you must include this flag.

      To create a snapshot from the recovery checkpoint of a source disk in a different project, you must specify the full path to the source disk.

    • SOURCE_PROJECT_ID: The project ID of the source disk whose checkpoint you want to use to create the snapshot.
    • SOURCE_REGION: The region of the source disk whose checkpoint you want to use to create the snapshot.
    • SOURCE_DISK_NAME: The name of the source disk whose checkpoint you want to use to create the snapshot.
    • STORAGE_LOCATION: Optional: The Cloud Storagemulti-region or the Cloud Storageregion where you want to store your snapshot. You can specify only one storage location.
      Use the storageLocations parameter only if you want to override the predefined or customized default storage location configured in your snapshot settings.
    • SNAPSHOT_TYPE: The snapshot type, either STANDARD or ARCHIVE. If a snapshot type is not specified, a STANDARD snapshot is created.

    You can use replica recovery checkpoint to create a snapshot only on degraded disks. If you try to create a snapshot from a replica recovery checkpoint when the device is fully replicated, you see the following error message:

    The device is fully replicated and should not create snapshots out of a recovery checkpoint. Please
    create regular snapshots instead.
    

  2. Create a new regional Persistent Disk volume using this snapshot. When you create the new disk, you recover all the data from the most recent replica recovery checkpoint by moving the data to the new disk. For detailed steps, see Create a new VM with regional Persistent Disk boot disks.

  3. Migrate all the VM workloads to the newly created disk and validate that these VM workloads are running correctly. For more information, see Move a VM across zones or regions.

After you recover and migrate your disk data and VMs to the newly created regional Persistent Disk volume, you can resume your operations.

Determine the RPO provided by replica recovery checkpoint

This section explains how to determine the RPO provided by the latest replica recovery checkpoint of a regional Persistent Disk volume.

Zonal replicas are fully synced

Compute Engine refreshes the replica recovery checkpoint of your regional Persistent Disk volume approximately every 10 minutes. As a result, when your zonal replicas are fully synced, the RPO is approximately 10 minutes.

Zonal replicas are out of sync

You can't view the exact creation and refresh timestamps of a replica recovery checkpoint. However, you can estimate the approximate RPO that your latest checkpoint provides by using the following data:

  • Most recent timestamp of the fully replicated disk state: You can get this information by using the regional Persistent Disk Cloud Monitoring data for the replica_state metric. Check the replica_state metric data for the out of sync replica to determine when the replica went out of sync. As Compute Engine refreshes the disk's checkpoint every 10 minutes, the most recent checkpoint refresh could have been approximately 10 minutes before this timestamp.
  • Most recent write operation timestamp: You can get this information by using the Persistent Disk Cloud Monitoring data for the write_ops_count metric. Check the write_ops_count metric data to determine the most recent write operation for the disk.

After you determine these timestamps, use the following formula to calculate the approximate RPO provided by the replica recovery checkpoint of your disk. If the calculated value is less than zero, then the RPO is effectively zero.

Approximate RPO provided by the latest checkpoint = (Most recent write operation timestamp - (Most recent timestamp of the fully replicated disk state - 10 minutes))

What's next