Replicate Persistent Disk volumes


This document describes how persistent disks can be accessed from virtual machine (VM) instances and the process of persistent disk replication. It also describes the core infrastructure of persistent disks. This document is intended for Google Cloud engineers and architects who want to use persistent disks in their systems.

Persistent disks are not local disks attached to the physical machines but rather, networking services attached to VMs as network block devices. When you read or write from a persistent disk, data is transmitted over the network. Persistent disks are a network storage device, but they enable many use cases and functionalities in terms of capacity, flexibility and reliability, which conventional disks cannot provide.

Persistent disks and Colossus

Persistent disks are designed to run in tandem with Google's file system, Colossus, which is a distributed block storage system. Persistent disk drivers automatically encrypt data on the VM before it's transmitted from the VM onto the network. Then, Colossus persists the data. When Colossus reads the data, the driver decrypts the incoming data.

image

Persistent disks use Colossus for the storage backend.

Having disks as a service is useful in a number of cases, for example:

  • Resizing the disks while the VM is running becomes easier than stopping the VM first. You can increase the disk size without stopping the VM.
  • Attaching and detaching disks becomes easier when disks and VMs don't have to share the same lifecycle or be co-located. It's possible to stop a VM and use its persistent boot disk to boot another VM.
  • High availability features like replication become easier because the disk driver can hide replication details and provide automatic write-time replication.

Disk latency

There are various benchmarking tools you can use to monitor the overhead latency from using disks as a networking service. The following example uses the SCSI disk interface, and not the NVMe interface, and shows the output of the VM doing a few reads of 4 KiB blocks from a persistent disk. The shows an example of the latency that you see in the reads:

$ ioping -c 5 /dev/sda1
4 KiB <<< /dev/sda1 (block device 10.00 GiB): time=293.7 us (warmup)
4 KiB <<< /dev/sda1 (block device 10.00 GiB): time=330.0 us
4 KiB <<< /dev/sda1 (block device 10.00 GiB): time=278.1 us
4 KiB <<< /dev/sda1 (block device 10.00 GiB): time=307.7 us
4 KiB <<< /dev/sda1 (block device 10.00 GiB): time=310.1 us
--- /dev/sda1 (block device 10.00 GiB) ioping statistics ---
4 requests completed in 1.23 ms, 16 KiB read, 3.26 k iops, 12.7 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 278.1 us / 306.5 us / 330.0 us / 18.6 us

Compute Engine also allows you to attach local SSDs to virtual machines in cases where you need the process to be as fast as possible. When running a cache server or running large data processing jobs where there is an intermediate output, we recommend that you choose local SSDs. Unlike persistent disks, data on local SSDs is not persistent and, as a result, the VM clears the data each time the virtual machine restarts. Local SSDs are only suitable for optimization cases.

The following output is an example of the latency that you see with 4 KiB reads from a local SSD using the NVMe disk interface:

$ ioping -c 5 /dev/nvme0n1
4 KiB <<< /dev/nvme0n1 (block device 375 GiB): time=245.3 us(warmup)
4 KiB <<< /dev/nvme0n1 (block device 375 GiB): time=252.3 us
4 KiB <<< /dev/nvme0n1 (block device 375 GiB): time=244.8 us
4 KiB <<< /dev/nvme0n1 (block device 375 GiB): time=289.5 us
4 KiB <<< /dev/nvme0n1 (block device 375 GiB): time=219.9 us
--- /dev/nvme0n1 (block device 375 GiB) ioping statistics ---
4 requests completed in 1.01 ms, 16 KiB read, 3.97 k iops, 15.5 MiB/s
generated 5 requests in 4.00 s, 20 KiB, 1 iops, 5.00 KiB/s
min/avg/max/mdev = 219.9 us / 251.6 us / 289.5 us / 25.0 us

Replication

When you create a new Persistent Disk, you can either create the disk in one zone, or replicate it across two zones within the same region.

For example, if you create one disk in a zone, such as in us-west1-a, you would have one copy of the disk. These are referred to as zonal disks. You can increase the disk's availability by storing another copy of the disk in a different zone within the region, such as in us-west1-b.

Disks replicated across two zones in the same region are called regional Persistent Disks.

It is unlikely for a region to fail altogether, but zonal failures can happen. Replicating within the region at different zones, as shown in the following image, helps with availability and reduces disk latency. If both replication zones fail, it is considered a region-wide failure.

image

Disk is replicated in two zones.

In the replicated scenario, the data is available in the local zone (us-west1-a) which is the zone the virtual machine is running in. Then, the data is replicated to another Colossus instance in another zone (us-west1-b). At least one of the zones should be the same zone that the VM is running in.

Note that persistent disk replication is only for the high availability of the disks. Zonal outages might also affect the virtual machines or other components, which can also cause outages.

Read/write sequences

In determining the read/write sequences, or the order in which data is read from and written to disk, the majority of the work is done by the disk driver in your VM. As a user, you don't have to deal with the replication semantics, and can interact with the file system as usual. The underlying driver handles the sequence for reading and writing.

By default, the system operates in full replication mode, where requests to read or write from disk are sent to both replicas.

In full replication mode, the following occurs:

  • When writing, a write request tries to write to both replicas and acknowledges when both writes succeed.
  • When reading, the VM sends a read request to both replicas, and returns the results from the one that succeeds. If the read request times out, another read request is sent.

If a replica falls behind and fails to acknowledge that the read or write requests completed, then the reads and writes are no longer sent to the replica. The replica must go through a reconciliation process to return it to a current state before replication can continue.

What's next