To stay informed about the health and status of your Google Cloud products, Google Cloud Service Health provides you with information on ongoing widespread incidents that meet certain criteria. This information can include product disruptions, outages, or informational messages about a temporary issue.
Google Cloud Service Health is designed to be available in the rare event Personalized Service Health itself is unavailable or affected by a disruption, or the impacted product has not yet onboarded to Personalized Service Health.
Personalized Service Health provides a personalized view of supported Google Cloud products and locations across your organization. Use Personalized Service Health as the first stop when facing a service disruption and check for communications about active and past Google Cloud incidents that might impact your projects. Personalized Service Health will always have the most information available to Google Cloud customers. You can access Personalized Service Health through the the Google Cloud console, by configuring alerts, and through the Personalized Service Health API.
This document focuses on Google Cloud Service Health.
Access Google Cloud Service Health
You can access Google Cloud Service Health through the following:
- A public status dashboard: Google Cloud Service Health
- A public RSS feed
The Google Cloud console:
In the Google Cloud console, go to the Support > Cases page.
Using the resource selector on the console toolbar, select the resource for which you'd like to list known issues.
Click the Known issues tab.
Known issues also include minor and limited-scope incidents. You can link a support case to a known issue so that you get regular updates and can communicate with support staff. Support cases are appropriate for issues that don't qualify as incidents or where direct interaction is needed. If you have Premium, Enhanced, or Standard Support, you can report an incident by creating a support case.
If you are unable to access Google Cloud Service Health through the previous resources, you can use the Google Cloud Platform Support Questions form.
Supported Google Cloud Service Health incidents
For most Google Cloud incidents, impacted customers receive incident communications directly through Personalized Service Health in the Google Cloud console. If they meet the alert conditions, these incidents also trigger any Service Health alerts that you have configured.
Incidents that meet any of the following criteria appear in Google Cloud Service Health:
- Major, public incidents
- Incidents for Google Cloud products that are not yet supported by Personalized Service Health
- Incidents that occur when the Personalized Service Health dashboard is unavailable
Major incident
Google Cloud defines an incident as a major incident if it meets all of the following conditions:
- High scope: the incident has global impact or is affecting a significant percentage of customer projects across one or more regions.
- High severity: one or more products are unavailable or severely degraded.
In the rare instance a major incident occurs, we act with urgency to resolve any issues.
During a major incident, the status of the issue is communicated through the Google Cloud Service Health dashboard. A major incident is marked as Service outage on the dashboard. After the issue is resolved, we publish a public incident report that includes the details of the factors that contributed to the incident and the steps we plan to take to prevent such incidents from reoccurring.
In the case of smaller-scoped incidents, a nonpublic report might be made available to customers.
Lifecycle of an incident
When a product degradation is detected, the Google Cloud Support team and product engineering team work together to resolve the incident and provide you with updates.
The following diagram shows the responsibilities of the product engineering and support teams:
You can read more about each of these responsibilities in the following sections.
Detection
Google Cloud uses internal and synthetic monitoring to detect incidents. For more information, see Chapter 6 of the Site Reliability Engineering book.
Initial response
When an incident is detected, the Google Cloud Customer Care team manages customer communications. Initial notification of an incident is often sparse, frequently only mentioning the product in question. This is because we prioritize fast notification over detail. Detail can be provided in subsequent updates.
To provide you as much information as possible without overwhelming you with issues that don't affect you, different communication channels are used depending on the scope and severity of an issue:
Investigate
Product engineering teams are responsible for investigating the root cause of incidents. Incident management is often done by Site Reliability Engineers but might be done by software engineers or others, depending on the situation and product. For more information, see Chapter 12 of the Site Reliability Engineering Book.
Mitigation and fix
An issue is considered fixed only when changes have been made that Google is confident will end the impact indefinitely. For example, the fix could be rolling back a change that triggered an incident.
While an incident is in progress, Customer Care and the product team attempt to mitigate the issue. Mitigation is when the impact or scope of an issue can be reduced, for example, by temporarily providing additional resources to a product suffering overload.
If no mitigation has been found, when possible, the Customer Care team finds and communicates workarounds. Workarounds are steps that you can take to solve the underlying need despite the incident. A workaround might be to use different settings for an API call to avoid a problematic code path.
Follow up
While an incident is ongoing, the Customer Care team provides regular updates. Updates typically provide:
More information about the incident, such as error messages, zones or regions affected, which features are affected, or percentages of impact.
Progress towards mitigation, including any workarounds.
Timelines for communication, tailored to the incident.
Changes in status, such as when an incident is fixed.
Retrospective
All incidents undergo an internal retrospective to fully understand the incident and identify reliability improvements that Google can make. These improvements are then tracked and implemented. For more information, see Chapter 15 of the Site Reliability Engineering Book.
Incident report
When incidents have very wide and serious impact, Google provides incident reports that outline the symptoms, impact, root cause, remediation, and future prevention of incidents. As with retrospectives, we pay particular attention to the steps that we take to learn from the issue and improve reliability. Google's goal in writing and releasing retrospectives is to be transparent and demonstrate our commitment to building stable products for our customers.
Incident data model
An incident can impact one or more products in one or more locations. Incidents have a start time and an end time, and an overall severity. An incident has updates that describe how the incident changes over time, including its status and the then impacted locations. The incident information is made available through a JSON schema.
The JSON schema has fields marked Stable and Unstable. In general, ID fields are considered Stable whereas fields such as display names are considered Unstable and might change without warning. Use Stable fields only when integrating with an external system or building automation. For more information, in this document, see Can I build integrations to consume Google Cloud Service Health data programmatically?.
FAQ
The following frequently asked questions might assist you when monitoring the health and status of your Google Cloud products.
Where can I find information about past product disruptions and outages?
Google Cloud Service Health keeps a record of disruptions and outages for Google Cloud products for up to five years. The dashboard shows the current status of products by locale. To view information about product disruptions and outages in the last year, click View incident history. To view a product's outage history for the last five years, click See more for that product.
How can I view regionalized status information for Google Cloud products?
Google Cloud Service Health displays the status of all Google Cloud products organized by region and global locale. To view the status for a multi-region, select the region-specific tab.
Can I build integrations to consume Google Cloud Service Health data programmatically?
Yes, you can consume the data displayed by Google Cloud Service Health in the following ways:
- Through an RSS feed
Through a JSON history file
You can download the schema for the JSON file from the public status dashboard.
The RSS feed and JSON history file provide incident status information which can be consumed through integrations.
Use the fields marked Stable in the JSON history file, instead of the fields
marked Unstable. For example, if you're trying to programmatically identify
incidents impacting a particular set of products, use the product IDs
(affected_products>id
), not their display names.
Product IDs versus product names
Historically, Google Cloud Service Health didn't provide a mechanism for locating the ID for a given product. Since early 2023, Google Cloud Service Health made available a product catalog which provides this mapping for all products. A product ID provides a stable field to key off while allowing the display name of a product to change. You should reference the product ID when programmatically identifying incidents impacting a set of products.
What if I have integrations based on prior Google Cloud Service Health implementations?
In both the RSS feed and the JSON file, the regional status information is an addition to the information that was already being published prior to the introduction of regionalized status reporting and change in the name of Google Cloud Service Health. Therefore, we expect your existing integrations to continue working. However, if you want to consume the regional status information through your integrations, then you need to modify them.
Here's a detailed description of how regional information is presented in both the RSS feed and JSON file:
RSS feed
The regional status information is a new addition to the feed information that was provided prior to the introduction of regionalized status. Any locations that are reported as affected are appended to the RSS message.
JSON file
Prior to the regional status update, Google Cloud published a stream of incidents where each incident contained a list of affected products and a list of status updates for each, if any. These status updates contained an unstructured string field that did or did not contain the location information.
Now, Google Cloud publishes a stream of incidents just as it did before. However, for every incident, each status update contains the following new fields:
updates.affected_locations
: contains a structured list of affected locations at the time the update was posted. Every update record and themost_recent_update
record contain this field.currently_affected_locations
: contains the most recent information on the locations that are actively impacted by the incident. Unlikeupdates.affected_locations
, this list becomes empty after the incident is resolved (that is, whenend
is set to a non-empty value).previously_affected_locations
: contains a list of locations that were previously impacted during an incident, but aren't currently. As the incident progresses, some locations might have an outage resolution. These locations will still exist in thepreviously_affected_locations field
. Once the incident is resolved (that is, whenend
is set to a non-empty value), this field contains a list of all locations that were impacted during this incident.
What if I am experiencing an issue, but it is not listed by Google Cloud Service Health?
Google Cloud Service Health provides current and historical status information for any major incident that affects Google Cloud products and services. If you are experiencing an issue that is not listed by Google Cloud Service Health, the issue might be isolated to your projects or instances, or it might impact a limited number of customers. Incidents that have less scope can be listed on the Support Portal. You can contact Customer Care about any issues you are experiencing that are not listed by Google Cloud Service Health.
If you are already using Personalized Service Health, check if the issue is listed there to determine if your project or instance is affected.
If you are using the Google Cloud console, in the top toolbar, select > Send feedback.
Who updates Google Cloud Service Health?
The global Customer Care team monitors the status of products using many different types of signals and updates Google Cloud Service Health in the event of a widespread issue. If needed, they will post a detailed incident analysis report after an incident has been resolved.
What's next
- Create and manage support cases
- Language support and working hours
- Best practices for working with Customer Care
- Best practices for working with Premium Support
- Privacy best practices