Be a Regional Persistent Disk monitoring superhero: How to know when you’re at RPO=0
Michael Ng
Product Manager
Wouldn’t it be wonderful if you were always in the know about how your mission-critical workload was doing on Google Cloud? Perhaps you are a compliance officer that is in charge of regulatory compliance of your applications or a cloud administrator that cares deeply about application observability, to ensure that they run smoothly in the cloud.
At Google Cloud,we built Regional Persistent Disk with mission-critical workloads in mind — to provide high-availability by using synchronous replication of writes (RPO = 0) across two Google Cloud availability zones. As we continue to innovate on high availability, we are excited to introduce two new capabilities: the Regional Persistent Disk Replication State Dashboard, and the Replica State Cloud Monitoring metric, which can help you monitor and gain insight into your Regional Persistent Disk’s replication state.
Benefits of monitoring Regional Persistent Disk replication states
- Monitoring for high-availability compliance audits: Regional Persistent Disks are used extensively as primary storage for mission-critical workloads (such as MySQL, SQL Server, and ElasticSearch) with strict high-availability compliance goals tied to them. Compliance audits are conducted regularly to verify that the workload and its underlying infrastructure meet the standards of availability and resilience as required by these goals. Without the ability to periodically monitor the current and historical replication state of Regional Persistent Disks, Google Cloud users have found it challenging to accurately prove that their application and workloads are compliant with their goals.
- Monitoring for high-availability compliance maintainability: In addition to periodically monitoring replication state for compliance reporting, Google Cloud users also want to continuously maintain high-availability and replication standards for their application and its data stored in Regional Persistent Disks. Maintaining replication standards can be challenging, as users must constantly inspect replication state to ensure that standards are being met. A proactive alerting mechanism that triggers when the replication state changes to a state that impacts compliance would greatly streamline the process of compliance maintainability.
Let’s take a deep dive into how Regional Persistent Disks enable these benefits for your mission critical workloads.
Replication States: A quick guide
Before diving deeper into how Regional Persistent Disk enables these benefits, it is useful to take a quick pit stop to understand the different replication states and why they are important to you. Regional Persistent Disks synchronously replicates data to two replicas in different Google Cloud zones in the same region. Depending on the state of the individual replicas, your regional Persistent Disk volume can be in one of the following replication states:
- Fully replicated: Replicas in both zones are available, and are fully replicating at RPO=0. In this state, users experience no data loss if a zonal issue requires their virtual machine to be failed over to the other replicated zone. For organizations with a strict high-availability compliance goal, this is the best replication state to be in.
- Degraded: One of the replicas is offline and data is not being replicated between the two replicas. In this state, availability of the disk will likely be affected if a zonal error occurs on the remaining replica. The disk will typically not be in this state for very long as the Persistent Disk platform will be actively self healing to get back to a fully replicated state as soon as possible. To avoid potential exposure to data unavailability due to a failure of the remaining replica, it is best to enable snapshots or Persistent Disk Asynchronous Replication.
- Catching up: Regional Persistent Disks go from being in degraded state, to catching up and finally fully replicated if the replication can be self healed. This state is a useful precursor state to inform you that the disk is most likely working towards being fully replicated.
For more details on replication states, please see the replication state public documentation.
Monitoring for compliance audits
The current and historical Regional Persistent Disk replication status can be observed from the Regional Persistent Disk Replication State Dashboard, located in the Google Cloud console for all attached Regional Persistent Disks:
Figure 1: The Regional Persistent Disk Replication State Dashboard in the Google Cloud console
With this dashboard, you can see the Replication state of both replicas of a Regional Persistent Disk. A value of 1 indicates that a replica is fully synced with the replica in the other zone while a value of 0 indicates replication is not in sync with the other replica. Both replicas need to be in sync for the Regional Persistent Disk to be fully replicated. This dashboard allows you to easily and quickly view the current and historical replication status of a Regional Persistent Disk for high availability compliance audits and reports.
For more complex auditing, dive deeper into a more detailed view of each replica’s state by creating your own custom monitoring view using Cloud Monitoring’s Metric Explorer and the Cloud Monitoring metric called the “Regional Disk Replica State”. This metric records replication state information of each replica in 60-day time windows to allow you to gain insights such as:
- How long has my Regional Persistent Disk been in a degraded replication state? Consequently the metric can be used to inform users on how long they have been compliant with their high availability compliance goals.
- Which replica is Out of Sync, causing my Regional Persistent Disk to be in a degraded state? This information is useful for troubleshooting and remediating replication issues with the disk.
Figure 2: Diving deeper into each replica using the Metrics Explorer for a single Regional Persistent Disk and a custom date range
For more information about this metric, see the Google Cloud Regional Disk Replica State documentation.
Since the Regional Disk Replica state is a Cloud Monitoring metric, this metric can be incorporated into many other monitoring and observability capabilities offered by Cloud Monitoring for more extensive audits:
- Creating dashboards with other official Cloud Monitoring metrics to create comprehensive views in combination with other product metrics like compute and networking.
- Customize your audits with more flexible metric analysis using the Cloud Monitoring Metrics Explorer.
- Export the metric to external monitoring tools like Grafana and Prometheus to integrate Regional Persistent Disk replication status into your organization’s observability tools.
Monitoring for compliance maintainability — proactive alerting
Figure 3: Setting up a proactive alerting policy using the Regional Disk Replica State metric
An additional benefit of the Regional Disk Replica State metric is the ability to be integrated into Cloud Monitoring Alerting Policies. This allows for flexible and proactive alerting into a diverse set of alerting channels including SMS, Slack, and PagerDuty to integrate into your organizations infrastructure alerting needs.
You can set up alerts for changes in Regional Persistent Disk replication status, such as when a disk is in an unreplicated state, or when individual replicas switch state, making it easy to track and maintain compliance of high availability goals. For more details on enabling an alerting policy with the Regional Disk Replica State metric, see the Cloud Monitoring Alert Policy documentation.
Use it today
If you’re interested in leveraging the Regional Persistent Disk Replication State Dashboard and the Regional Disk Replica State Metric to gain insight into your Regional Persistent Disk high availability compliance, the dashboard and metric is available today for all projects with attached Regional Persistent Disks. For more information on how to get started, please visit the Regional Disk Replication Monitoring documentation page.