Incidents and the Google Cloud Service Health Dashboard

The Google Cloud Service Health (CSH) Dashboard provides status information of the Google Cloud services organized by region and global locale.

Major incident

The impact of a major incident extends to two or more of the following scenarios:

  • Multiple services
  • Multiple regions
  • Multiple hours
  • Multiple customers

In the rare instance a major incident occurs, we act with urgency to resolve any issues.

During a major incident, the status of the issue is communicated through the Google Workspace Status Dashboard and the Google Cloud Service Health Dashboard. A major incident is marked as Service outage on the status dashboards. After the issue is resolved, we publish a public incident report that includes the details of the factors that contributed to the incident and the steps we plan to take to prevent such incidents from reoccurring.

In the case of smaller-scoped incidents, a nonpublic report might be made available to customers.

Lifecycle of an incident

When a product degradation is detected, the Cloud Customer Care team and product engineering team work together to resolve the incident and communicate it to you.

Lifecycle Diagram

Detection

Google Cloud uses internal and black box monitoring to detect incidents. For more information, see Chapter 6 of the Site Reliability Engineering book.

If you have Premium, Enhanced, Standard, Role-Based, or Enterprise Support, you can report an incident by creating a support case in the Google Cloud console. If you have Platinum, Gold, or Silver support, you can report an incident by creating a support case in the Google Cloud Support Center. Otherwise, you can use this form.

Initial Response

When an incident is detected, the Customer Care team leads communication with you. Initial notification of an incident is often sparse, frequently only mentioning the product in question. This is because we prioritize fast notification over detail. Detail can be provided in subsequent updates.

To provide you as much information as possible without overwhelming you with issues that do not affect you, different communication channels are used depending on the scope and severity of an issue:

Comms Diagram

The Google CSH Dashboard is the first place to check when you discover an issue is affecting you. The dashboard shows incidents that affect many customers, so if you see an incident listed it is likely related to your problem. To indicate severity, the dashboard marks incidents as either a disruption or outage. More minor, but still widespread issues are posted as temporary notices.

When a relevant Google Cloud product or service reports an issue in the Google CSH Dashboard, you might also see an outage notice in the Cloud console. If an outage notice appears in the Cloud console, you can click the notice to learn more about the status of the issue.

Some Google Cloud products have Google Groups that you can subscribe to in order to receive announcements and notifications about new incidents on the Google CSH Dashboard.

The known issues displayed in the Google Cloud Support Center and in the Cloud console Support page are the most comprehensive view of issues, and includes issues that affect fewer people than are shown on the dashboard. If you suspect a GCP issue but do not see anything on the dashboard, then you should check here.

Support cases are appropriate for issues that do not qualify as incidents or where a one-to-one human touch is needed. The known issues page allows you to create a case from a posted incident so that you get regular updates and can talk to support staff.

Investigate

Product engineering teams are responsible for investigating the root cause of incidents. Incident management is often done by Site Reliability Engineers but might be done by software engineers or others, depending on the situation and product. For more information, see Chapter 12 of the Site Reliability Engineering Book.

Mitigation/Fix

An issue is considered fixed only when changes have been made that Google is confident will end the impact indefinitely. For example, the fix could be rolling back a change that triggered an incident.

While an incident is in progress, Customer Care and the product team attempt to mitigate the issue. Mitigation is when the impact or scope of an issue can be reduced, for example, by temporarily providing additional resources to a service suffering overload.

If no mitigation has been found, when possible, the Customer Care team finds and communicates workarounds. Workarounds are steps that you can take to solve the underlying need despite the incident. A workaround might be to use different settings for an API call to avoid a problematic code path.

Follow Up

While an incident is ongoing, the Customer Care team provides regular updates. Updates typically provide:

  • More information about the incident, such as error messages, zones or regions affected, which features are affected, or percentages of impact.

  • Progress towards mitigation, including any workarounds.

  • Timelines for communication, tailored to the incident.

  • Changes in status, such as when an incident is fixed.

Postmortem

All incidents have a postmortem internally to fully understand the incident and identify reliability improvements that Google can make. These improvements are then tracked and implemented. For more information on postmortems at Google, see Chapter 15 of the Site Reliability Engineering Book.

Incident Report

When incidents have very wide and serious impact, Google provides incident reports that outline the symptoms, impact, root cause, remediation, and future prevention of incidents. As with postmortems, we pay particular attention to the steps that we take to learn from the issue and improve reliability. Google's goal in writing and releasing postmortems is to be transparent and demonstrate our commitment to building stable services for our customers.

FAQ

What type of status information can I find on the Google CSH Dashboard?

The Google CSH Dashboard provides status information on services that are part of Google Cloud. Status can include service disruptions, outages, or informational messages about a temporary issue.

Where can I find information about past service disruptions and outages?

The Google CSH Dashboard keeps a record of disruptions and outages for the Google Cloud services for up to five years. The Overview tab of the dashboard shows the current status of the services by locale. To view information about service disruptions and outages in the last year, click View history on the dashboard. To view a service's outage history for the last five years, click See more for that service.

How can I view regionalized status information for Google Cloud services?

The Google CSH Dashboard displays the status of all Google Cloud services organized by region and global locale. To view service status for a multi-region, click on the region-specific tab.

Can I build integrations to consume the data displayed on the Google CSH Dashboard programmatically?

Yes, you can consume the data displayed on the Google CSH Dashboard in the following ways:

  • Through an RSS feed
  • Through a JSON History file

    You can download the schema for JSON file here.

The RSS feed and JSON History file provide incident status information which can be consumed through integrations.

What if I have pre-built integrations based on the Google Cloud Status Dashboard prior to the introduction of regionalized status reporting and name change to Google Cloud Service Health Dashboard?

In both the RSS feed and the JSON file, the regional status information is additive to the information that was already being published prior to the introduction of regionalized status reporting and change in the name of Google Cloud Status Dashboard. Therefore, we expect your existing integrations to continue working. However, if you want to consume the regional status information through your integrations, then you need to modify them.

Here's a detailed description of how regional information is presented in both RSS feed and JSON file:

  • RSS feed

    The regional status information is a new addition to the feed information that was provided prior to the introduction of regionalized status. Any locations that are reported as affected are appended to the RSS message.

  • JSON file

    Prior to the regional status update, Google Cloud published a stream of incidents where each incident contained a list of affected products and a list of status updates for each, if any. These status updates contained an unstructured string field that did or did not contain the location information.

    Now, Google Cloud publishes a stream of incidents just as it did before. However, for every incident, each status update contains the following new fields:

    • updates.affected_locations: contains a structured list of affected locations at the time the update was posted. Every update record and the most_recent_update record contain this field.
    • currently_affected_locations: contains the most recent information on the locations that are actively impacted by the incident. Unlike updates.affected_locations, this list becomes empty after the incident is resolved (that is, when end is set to a non-empty value).
    • previously_affected_locations: contains a list of locations that were previously impacted during an incident, but aren't currently. As the incident progresses, some locations might have an outage resolution. These locations will still exist in the previously_affected_locations field. Once the incident is resolved (that is, when end is set to a non-empty value), this field contains a list of all locations that were impacted during this incident.

What if I am experiencing an issue, but it is not listed on the dashboard?

The issue may be isolated to your projects or instances, or it may be impacting a limited number of customers. You can contact Customer Care about any issues you are experiencing that are not listed on the dashboard.

If you are using Cloud console, you can click the Send feedback tool in the upper right corner to report problems.

Who updates the dashboard?

The global Customer Care team monitors the status of services using many different types of signals and updates the dashboard in the event of a widespread issue. If needed, they will post a detailed incident analysis report after an incident has been resolved.

What is the difference between an incident and an outage?

Although these terms are often used interchangeably, Google CSH Dashboard and our external communications uses incident to refer to any period of degraded service and outage to refer only to the most serious, where a product is nonfunctioning to a large extent.