Google Cloud incident communication

This document outlines Google Cloud's approach to communicating incidents, including the roles of the following primary communication channels: Personalized Service Health (PSH) and the public Google Cloud Service Health (CSH) dashboard.

Where to look for incident communications

Google Cloud provides two channels for incident communication, each with a different purpose:

  • Personalized Service Health (PSH): This is your primary source for service disruptions relevant to you. It provides a personalized view tailored to the specific Google Cloud products you use. We recommend integrating Personalized Service Health into your incident response process as a critical corroboration signal.

    To avoid paging an on-call engineer for every Personalized Service Health event, integrate Personalized Service Health incident visibility into your team's dashboards and tools. This practice helps operators quickly determine if a suspected issue is related to a Google Cloud service disruption. Learn more about PSH.

  • Google Cloud Service Health (CSH): This is Google Cloud's public-facing status page, available at status.cloud.google.com. Google Cloud Service Health requires no login and serves as an at-a-glance health check for the entire platform, and is used to communicate broad severe incidents or when PSH itself is unavailable.

A diagram comparing PSH to CSH communication channels. The diagram shows emerging incidents,
confirmed incidents, and broad severe incidents going to Personalized Service Health.
The diagram shows broad severe incidents going to Cloud Service Health. Listed under
Personalized Service Health, is Dashboard, API, and Cloud Logging. Listed
under Cloud Service Health is Dashboard and RSS Feed.

Our disclosure strategy and recommendations

Deciding what to share, and where, is not an arbitrary process. It is a formal, systematic discipline based primarily on an incident's scope. The intent of these channels is to provide maximum visibility for Broad Severe Incidents while reducing the noise of incidents that are irrelevant.

  • For Broad Severe Incidents: Broad-scoped incidents—those impacting a large percentage of projects or widespread across multiple regions—are communicated using Google Cloud Service Health (CSH). These incidents are also communicated to affected customers using Personalized Service Health. This helps ensure the message reaches the widest possible audience for the most critical events.

  • For other Confirmed Incidents: For issues with a more limited scope, such as those impacting a single location, zone, or a smaller subset of projects, we will communicate relevant incidents to customers using Personalized Service Health. We aim to be comprehensively transparent, which means Service Health makes available all potentially relevant events for your services. If you need a more focused event feed, Service Health offers tools to filter and fine-tune the events that are passed through your alerts and automated workflows.

    • Recommendation: Configure alerts to only focus on the most critical Google Cloud services and locations, or to trigger only on events with a relevance of "Related" or "Impacted." See examples of how to filter and fine-tune alerts.

Fallback Strategies for Personalized Service Health unavailability

Personalized Service Health depends on core services, such as Identity and Access Management for authentication. In a severe, widespread disruption, the very services you need to sign in might be affected.

We recommend the following fallback strategy:

  • For manual processes: Your runbooks should direct operators to the Google Cloud Service Health dashboard at status.cloud.google.com in the event they can't access the Personalized Service Health dashboard.

  • For automated systems: Use the Service Health Status API to programmatically detect if Personalized Service Health is having a problem. If it is, your systems can then fall back to ingesting the public CSH RSS Feed for continued programmatic updates.