Failover and failback asynchronous disks


This document describes how to failover and failback Persistent Disk Asynchronous Replication (PD Async Replication) disks.

In the event of an outage in the primary region, it is your responsibility to identify the outage and failover restart your workload using the secondary disks, in the secondary region. PD Async Replication doesn't offer outage monitoring. You can identify an outage using RPO metrics, health checks, application-specific metrics, and by contacting Cloud Customer Care.

Following a failover from the primary region to the secondary region, the secondary region becomes the acting primary region.

After the outage or disaster gets resolved, you can initiate failback to start replication from the original secondary region (the acting primary region) to the original primary region. You can optionally repeat the process to move the workload back to the original primary region. Moving the workload back to the original primary region isn't strictly necessary, but can be done based on disaster recovery requirements, such as locality or available resources.

To learn more about failover and failback, see About Persistent Disk Asynchronous Replication.

Failover to the secondary region

When you identify that a disaster has occurred, initiate failover to the secondary region. A failover moves the workload from the primary region to the secondary region. After the failover, the secondary disk is the acting primary disk and the secondary region is the action primary region.

You can failover a single disk, or all disks in a consistency group.

Single disk

To failover a single disk, do the following:

  1. Stop disk replication.
  2. If you don't already have a VM in the same region as the secondary disk, create one.
  3. Attach the secondary disk to the VM:

    The secondary disk is now the workload's acting primary disk and the secondary region is the acting primary region.

Consistency group

To failover a consistency group, do the following:

  1. Stop consistency group replication.
  2. If you don't already have VMs in the same region as the secondary disks, create them.
  3. Attach the secondary disks to the VMs:

Failback to the original primary region

After a disaster has resolved, initiate a failback to the original primary region. A failback configures and starts replication from the acting primary disk to a new secondary disk in the acting secondary region.

You can failback a single disk, or all disks in a consistency group.

Single disk

To failback a single disk, do the following:

  1. Create a secondary disk in the acting secondary region. The acting secondary region is the original primary region.
  2. Start replication from the acting primary disk to the new secondary disk.
  3. Optional: Move the workload from the acting primary region to the original primary region by doing the following:

    1. Wait for the initial replication to complete. The initial replication is complete when the disk/async_replication/time_since_last_replication metric is available in Cloud Monitoring. If you don't see the RPO metric in Cloud Explorer, that means the initial replication isn't complete.
    2. Recommended: To avoid data loss, schedule downtime for the workload and bring the workload offline.
    3. Stop replication.
    4. Attach the secondary disk to a VM:

      The secondary disk is now the workload's primary disk in the original primary region.

    5. Reconfigure replication in the original primary region by doing the following:

      1. Create a new secondary disk in the original secondary region.
      2. Start replication from the primary disk to the new secondary disk.

Consistency group

To failback a consistency group, do the following:

  1. Create a new consistency group in the acting primary region. The acting primary region is the original secondary region.
  2. Add the acting primary disks to the consistency group
  3. Create secondary disks in the acting secondary region that reference the acting primary disks.
  4. Start replication.
  5. Optional: Move the workload from the acting primary region to the original primary region by doing the following:

    1. Wait for the initial replication to complete. The initial replication is complete when the RPO metric is available.. If you don't see the RPO metric in Cloud Explorer, that means the initial replication isn't complete.
    2. Recommended: To avoid data loss, schedule downtime for the workload and bring the workload offline.
    3. Stop replication.
    4. Attach the secondary disk to VMs:

      The secondary disks are now the workload's primary disks in the original primary region.

    5. Reconfigure replication in the original primary region by doing the following:

      1. Add the primary disks to the original consistency group.
      2. Create new secondary disks in the original secondary region.
      3. Start replication.

What's next