Availability best practices

This page describes best practices for ensuring high availability for your Google Distributed Cloud Edge connected installation. Distributed Cloud connected does not offer a service level agreement (SLA) and only provides the service level objective (SLO) described on this page.

Choose and implement the level of availability

You must choose the level of availability for your Distributed Cloud connected workloads that best suits your business requirements. For example, a self-checkout application at a retail store has a much lower availability risk than an edge RAN deployment of a mobile network carrier.

Target availability is directly proportional to the Distributed Cloud spare resource capacity that you reserve for emergencies. The following table describes this relationship. These estimates do not include the downtime scheduled with a maintenance window.

GDC Edge form factor Capacity in use Reserved capacity Target availability
GDC Edge Rack
(single 6-machine cluster)
83.33% 16.67% 99.9%
GDC Edge Rack
(single 6-machine cluster)
100% 0% 93.5%
GDC Edge Server
(single 3-machine cluster)
66.6% 33.3% 99.9%

You might experience a sudden loss of capacity due to hardware failure or a node that requires a restart. To prepare for this, you must architect your workloads with resource quotas in mind so that you always have available capacity on each Distributed Cloud connected node that meets your chosen level of availability.

For example, to achieve 99.9% target availability on a Distributed Cloud connected rack deployment, you must configure your workloads so that one of the six physical machines in each Distributed Cloud connected cluster is available as a backup.

Use survivability mode

Distributed Cloud lets you create clusters that use a local control plane that runs on your Distributed Cloud connected hardware. Such clusters allow workloads to continue running when the connection to Google Cloud is lost. For more information, see Distributed Cloud connected survivability mode.

Understand software updates and maintenance windows

Google regularly updates the Distributed Cloud connected software. These software updates are mandatory and you cannot opt out of them. Distributed Cloud connected lets you specify individual maintenance windows for each of your Distributed Cloud connected clusters.

To mitigate potential transient disruptions to your workloads, maintenance windows let you control when automatic upgrades of control planes and nodes can occur. Maintenance windows are useful for the following types of scenarios, among others:

  • Off-peak hours: You want to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced.
  • On-call: You want to ensure that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues.
  • Multi-cluster upgrades: You want to roll out upgrades across multiple clusters in different regions one at a time at specified intervals.

In addition to automatic upgrades, Google might occasionally need to perform other maintenance tasks. In those cases, it honors a cluster's maintenance window when possible.

If tasks run beyond the maintenance window, Distributed Cloud connected attempts to pause the tasks. It then attempts to resume those tasks during the next maintenance window.

Distributed Cloud connected reserves the right to roll out unplanned emergency upgrades outside of maintenance windows. Additionally, mandatory upgrades from deprecated or outdated software might automatically occur outside of maintenance windows.

You can also manually upgrade your cluster at any time. Manually-initiated upgrades begin immediately and ignore any maintenance windows.

To learn how to set up a maintenance window for a new or existing cluster, see Configure a maintenance window.

Restrictions

Maintenance windows have the following restrictions:

  • One maintenance window per cluster. You can only configure a single maintenance window per cluster. Configuring a new maintenance window overwrites the previous one.

  • Time zones for maintenance windows. When configuring and viewing maintenance windows, times are shown differently depending on the tool that you are using, as detailed in the following sections.

When configuring maintenance windows

When you use the more generic --maintenance-window flag to configure a maintenance window, you cannot specify a time zone. When you use the Google Cloud CLI or the API, UTC is used to display times. The Google Cloud console uses the local time zone to display times.

When you use more granular flags, such as --maintenance-window-start, you can specify the time zone as part of the value. If you omit the time zone, your local time zone is used. Times are always stored in UTC.

When viewing maintenance windows

When you view information about your cluster, timestamps for maintenance windows can be shown in UTC or in your local time zone, depending on how you are viewing the information:

  • When you use the Google Cloud console to view information about your cluster, times are always displayed in your local time zone.
  • When you use the gcloud CLI to view information about your cluster, times are always shown in UTC.

In both cases, the RRULE is always in UTC. That means that if specifying, for example, days of the week, then those days are in UTC.

Configure cluster maintenance windows

Distributed Cloud connected lets you specify a maintenance window for each of your Distributed Cloud connected clusters. This window tells Google to only update the Distributed Cloud software during the time and at the frequency that you specify.

The following rules govern Distributed Cloud connected cluster maintenance windows:

  • If you specify a maintenance window for a Distributed Cloud connected cluster, Google updates your Distributed Cloud connected software 48 hours after the update has been announced through the Distributed Cloud connected release notes. On the release notes page, you can subscribe to the Distributed Cloud connected release notes RSS feed to stay informed about software updates as they are released.
  • The minimum length of a maintenance window is six hours. You can specify a longer window based on the complexity of your Distributed Cloud connected installation and your business requirements.
  • The minimum frequency of software updates is once per week. You can specify either weekly or daily maintenance windows. You can include and exclude specific days.
  • You can change the maintenance window schedule for a cluster at any time, except when a maintenance window has already been scheduled or when a maintenance window is in progress.
  • If the software update does not complete within the specified time window, it pauses and then resumes during the next scheduled maintenance window.

For detailed instructions, see Configure a maintenance window for a cluster.

Repair of failed hardware

When Google detects a failure of the Distributed Cloud connected hardware, Google attempts to schedule a site visit within three business days. For a Google-authorized technician to perform the necessary diagnosis and repairs, you must grant them access to the Distributed Cloud connected hardware.

If a failure of Distributed Cloud connected hardware occurs, one of the following scenarios applies depending on whether your Distributed Cloud connected hardware uses Self-Encrypting Disk (SED) storage:

  • Distributed Cloud connected racks store data on non-SED drives. When Google performs on-site repairs, all disk drives are removed from the affected Distributed Cloud connected machine before servicing begins and are placed in your custody for the duration of the repair.

  • Distributed Cloud connected servers store data on SED drives. When a machine fails, Google replaces the entire machine. Before the machine is removed from your premises, Google ensures that your data has been securely wiped from all of its drives.

Other points of failure

You are responsible for maintaining the following aspects of your Distributed Cloud installation that are outside of Google's control and can affect the availability of Distributed Cloud connected:

  • Any and all data that you choose to store on Distributed Cloud connected hardware. This includes functioning redundant backups and the export of your data before returning your Distributed Cloud connected hardware to Google.
  • Electrical power supply.
  • Ambient temperature, humidity, and cooling.
  • Physical hardware security.
  • Local network security.
  • Local network and internet connectivity:
    • For cloud control plane clusters, Distributed Cloud connected requires a constant connection to Google Cloud and cannot function without it.
    • For local control plane clusters, Distributed Cloud connected must reconnect to Google Cloud every 7 days to refresh security tokens, encryption keys, and synchronize logging and management data.

What's next