Lifecycle of an incident
When a product degredation is detected, the product engineering team and Google Cloud Platform Support team work together to resolve the incident and communicate it to you.
Google uses internal and black box moniotoring to detect incidents. For more information, see Chapter 6 of the Site Reliability Engineering book.
When an incident is detected, the Support team leads communication with you. Initial notification of an incident is often sparse, frequently only mentioning the product in question. This is because we prioritize fast notification over detail. Detail can be provided in subsequent updates.
To provide you as much information as possible without overwhelming you with issues that do not affect you, different communication channels are used depending on the scope and severity of an issue:
The Cloud Status Dashboard is the first place to check when you discover an issue is effecting you. The dashboard shows incidents that affect many customers, so if you see an incident listed it is likely related to your problem. To indicate severity, the status dashboard marks incidents as either a disruption or outage. More minor, but still widespread issues are are posted as temporary notices.
The known issues displayed in the Google Cloud Support Center and in the Cloud Console Support page are the most comprehensive view of issues, and includes issues that affect fewer people than are shown on the dashboard. If you suspect a GCP issue but do not see anything on the dashboard, then you should check here.
Support cases are appropriate for issues that do not qualify as incidents or where a one-to-one human touch is needed. The known issues page allows you to create a case from a posted incident so that you get regular updates and can talk to support staff.
Product engineering teams are responsible for investigating the root cause of incidents. Incident management is often done by Site Reliability Engineers but might be done by software engineers or others, depending on the situation and product. For more information, see Chapter 12 of the Site Reliability Engineering Book.
An issue is considered fixed only when changes have been made that Google is confident will end the impact indefinitly. For example, the fix could be rolling back a change that triggered an incident.
While an incident is in progress, Support and the Product team will attempt to mitigate the issue. Mitigation is when the impact or scope can of an issue can be reduced, for example by temporarily providing additional resources to a service suffering overload.
If no mitigation has been found, when possible, the Support team will find and communicate workarounds. Workarounds are steps that you can take to solve the underlying need despite the incident. A workaround might be to use different settings for an API call to avoid a problematic code path.
While an incident is ongoing, the Support team provides regular updates. Updates typically provide:
More information about the incident, such as error messages, zones or regions affected, which features are affected, or percentages of impact.
Progress towards mitigation, including any workarounds.
Timelines for communication, tailored to the incident.
Changes in status, such as when an incident is fixed.
All incidents have a postmortem internally to fully understand the incident and identify reliability improvements that Google can make. These improvements are then tracked and implemented. For more information on postmortems at Google, see Chapter 15 of the Site Reliability Engineering Book.
When incidents have very wide and serious impact, Google provides incident reports that outline the symptoms, impact, root cause, remediation, and future prevention of incidents. As with postmortems, we pay particular attention to the steps that we take to learn from the issue and improve reliability. Google's goal in writing and releasing postmortems is to be transparent and demonstrate our commitment to building stable services for our customers.
What type of status information can I find on the dashboard home page?
The Google Cloud Status Dashboard provides status information on services that are part of Google Cloud Platform. Status can include service disruptions, outages, or informational messages about a temporary issue.
Where can I find information about past service disruptions and outages?
The Summary and History page is a repository of disruptions and outages from the past 365 days. Click an incident number to review the posts about the incident while it was ongoing, as well as any incident summary reports written by the Support team.
What if I am experiencing an issue, but it is not listed on the dashboard?
The issue may be isolated to your projects or instances, or it may be impacting a limited number of customers. You can contact Support about any issues you are experiencing that are not listed on the dashboard.
If you are using Google Cloud Platform Console, you can click the Send feedback tool in the upper right corner to report problems.
Who updates the dashboard?
The global Google Cloud Platform Support team monitors the status of services using many different types of signals and updates the dashboard in the event of a widespread issue. If needed, they will post a detailed incident analysis report after an incident has been resolved.
What is the difference between an "incident" and an "outage"?
Although these terms are often used interchangeably, Cloud Status Dashboard and our external communications uses "incident" to refer to any period of degraded service and "outage" to refer only to the most serious, where a product is nonfunctioning to a large extent.