Availability best practices

This page describes best practices for ensuring high availability for your Google Distributed Cloud Edge installation. Distributed Cloud Edge does not offer a service level agreement (SLA) and only provides the service level objective (SLO) described on this page.

Choose and implement the level of availability

You must choose the level of availability for your Distributed Cloud Edge workloads that best suits your business requirements. For example, a self-checkout application at a retail store has a much lower availability risk than an edge RAN deployment of a mobile network carrier.

Target availability is directly proportional to the Distributed Cloud Edge spare resource capacity that you reserve for emergencies. The following table describes this relationship. These estimates do not include the downtime scheduled with a maintenance window.

GDC Edge form factor Capacity in use Reserved capacity Target availability
GDC Edge Rack
(single 6-machine cluster)
83.33% 16.67% 99.9%
GDC Edge Rack
(single 6-machine cluster)
100% 0% 93.5%
GDC Edge Server
(single 3-machine cluster)
66.6% 33.3% 99.9%

You might experience a sudden loss of capacity due to hardware failure or a node that requires a restart. To prepare for this, you must architect your workloads with resource quotas in mind so that you always have available capacity on each Distributed Cloud Edge node that meets your chosen level of availability.

For example, to achieve 99.9% target availability on a Distributed Cloud Edge Rack deployment, you must configure your workloads so that one of the six physical machines in each Distributed Cloud Edge cluster is available as a backup.

Use survivability mode

Distributed Cloud Edge lets you create clusters that use a local control plane that runs on your Distributed Cloud Edge hardware. Such clusters allow workloads to continue running when the connection to Google Cloud is lost. For more information, see Distributed Cloud Edge survivability mode.

Understand software updates and maintenance windows

Google regularly updates the Distributed Cloud Edge software. These software updates are mandatory and you cannot opt out of them. Distributed Cloud Edge lets you specify individual maintenance windows for each of your Distributed Cloud Edge clusters.

To mitigate potential transient disruptions to your workloads, maintenance windows let you control when automatic upgrades of control planes and nodes can occur. Maintenance windows are useful for the following types of scenarios, among others:

  • Off-peak hours: You want to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced.
  • On-call: You want to ensure that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues.
  • Multi-cluster upgrades: You want to roll out upgrades across multiple clusters in different regions one at a time at specified intervals.

In addition to automatic upgrades, Google might occasionally need to perform other maintenance tasks. In those cases, it honors a cluster's maintenance window when possible.

If tasks run beyond the maintenance window, Distributed Cloud Edge attempts to pause the tasks. It then attempts to resume those tasks during the next maintenance window.

Distributed Cloud Edge reserves the right to roll out unplanned emergency upgrades outside of maintenance windows. Additionally, mandatory upgrades from deprecated or outdated software might automatically occur outside of maintenance windows.

You can also manually upgrade your cluster at any time. Manually-initiated upgrades begin immediately and ignore any maintenance windows.

To learn how to set up a maintenance window for a new or existing cluster, see Configure a maintenance window.

Restrictions

Maintenance windows have the following restrictions:

  • One maintenance window per cluster. You can only configure a single maintenance window per cluster. Configuring a new maintenance window overwrites the previous one.

  • Time zones for maintenance windows. When configuring and viewing maintenance windows, times are shown differently depending on the tool that you are using, as detailed in the following sections.

When configuring maintenance windows

When you use the more generic --maintenance-window flag to configure a maintenance window, you cannot specify a time zone. When you use the Google Cloud CLI or the API, UTC is used to display times. The Google Cloud console uses the local time zone to display times.

When you use more granular flags, such as --maintenance-window-start, you can specify the time zone as part of the value. If you omit the time zone, your local time zone is used. Times are always stored in UTC.

When viewing maintenance windows

When you view information about your cluster, timestamps for maintenance windows can be shown in UTC or in your local time zone, depending on how you are viewing the information:

  • When you use the Google Cloud console to view information about your cluster, times are always displayed in your local time zone.
  • When you use the gcloud CLI to view information about your cluster, times are always shown in UTC.

In both cases, the RRULE is always in UTC. That means that if specifying, for example, days of the week, then those days are in UTC.

Configure cluster maintenance windows

Distributed Cloud Edge lets you specify a maintenance window for each of your Distributed Cloud Edge clusters. This window tells Google to only update the Distributed Cloud Edge software during the time and at the frequency that you specify.

The following rules govern Distributed Cloud Edge cluster maintenance windows:

  • If you specify a maintenance window for a Distributed Cloud Edge cluster, Google updates your Distributed Cloud Edge software 48 hours after the update has been announced through the Distributed Cloud Edge release notes. On the release notes page, you can subscribe to the Distributed Cloud Edge release notes RSS feed to stay informed about software updates as they are released.
  • The minimum length of a maintenance window is six hours. You can specify a longer window based on the complexity of your Distributed Cloud Edge installation and your business requirements.
  • The minimum frequency of software updates is once per week. You can specify either weekly or daily maintenance windows. You can include and exclude specific days.
  • You can change the maintenance window schedule for a cluster at any time, except when a maintenance window has already been scheduled or when a maintenance window is in progress.
  • If the software update does not complete within the specified time window, it pauses and then resumes during the next scheduled maintenance window.

For detailed instructions, see Configure a maintenance window for a cluster.

Repair of failed hardware

When Google detects a failure of the Distributed Cloud Edge hardware, Google attempts to schedule a site visit within three business days. For a Google-authorized technician to perform the necessary diagnosis and repairs, you must grant them access to the Distributed Cloud Edge hardware.

If a failure of Distributed Cloud Edge hardware occurs, one of the following scenarios applies depending on whether your Distributed Cloud Edge hardware uses Self-Encrypting Disk (SED) storage:

  • Distributed Cloud Edge Racks store data on non-SED drives. When Google performs on-site repairs, all disk drives are removed from the affected Distributed Cloud Edge machine before servicing begins and are placed in your custody for the duration of the repair.

  • Distributed Cloud Edge Servers store data on SED drives. When a machine fails, Google replaces the entire machine. Before the machine is removed from your premises, Google ensures that your data has been securely wiped from all of its drives.

Other points of failure

You are responsible for maintaining the following aspects of your Distributed Cloud Edge installation that are outside of Google's control and can affect the availability of Distributed Cloud Edge:

  • Any and all data that you choose to store on Distributed Cloud Edge hardware. This includes functioning redundant backups and the export of your data before returning your Distributed Cloud Edge hardware to Google.
  • Electrical power supply.
  • Ambient temperature, humidity, and cooling.
  • Physical hardware security.
  • Local network security.
  • Local network and internet connectivity:
    • For cloud control plane clusters, Distributed Cloud Edge requires a constant connection to Google Cloud and cannot function without it.
    • For local control plane clusters, Distributed Cloud Edge must reconnect to Google Cloud every 7 days to refresh security tokens, encryption keys, and synchronize logging and management data.

What's next