Introducing Personalized Service Health: Upleveling incident response communications
Senior Product Manager, Cloud Reliability
Group Product Manager, Cloud Reliability
When an incident disrupts a cloud service that you rely on, an effective response starts with identifying the source of that disruption and evaluating the scope of impact. This is crucial to charting a course of action — whether that’s communicating with your stakeholders or deploying a disaster recovery procedure. But when you use a cloud service provider, your ability to mount an effective incident response is dependent on the transparency, timeliness, and actionability of the incident communications provided.
Today, we’re excited to introduce Personalized Service Health, which provides fast, transparent, relevant, and actionable communication about Google Cloud service disruptions. Currently in Preview, you can use Personalized Service Health to receive granular alerts about Google Cloud service disruptions, as a stop in your incident response, or integrated with your incident response or monitoring tools.
Why should I use Personalized Service Health
Today, when Google detects an incident that could potentially impact you, we publish that information openly with Google Cloud Service Health, our highly reliable public dashboard that delivers information on active incidents that require wide distribution — typically those that tend to be larger in scope or severity. Organized by Google Cloud products and the regions they operate in, Google Cloud Service Health displays real-time information about incidents impacting Google Cloud products and provides mechanisms to download service disruption history.
Personalized Service Health takes these benefits a step further, and is the ideal destination for many customers to start their incident response journey. Personalized Service Health provides:
Controls to decide the service disruptions relevant to you: Google Cloud Service Health posts incidents that affect a broad set of customers, and is not an exhaustive list of incidents. If you prefer to see or be alerted of more incidents, earlier or more often — even smaller-scale ones — you can use Personalized Service Health to configure how and when you are alerted about incidents.
Ability to integrate with your incident management workflow: Personalized Service Health offers multiple integration options with your preferred incident management tools and workflows — for example, you can integrate alerts with PagerDuty to alert the appropriate incident responders when a service disruption begins.
Proactive incident discoverability: Personalized Service Health emits logs and can push customizable alerts to make incidents more discoverable in your workflow.
Let’s take a deeper look at these benefits.
Configure alerts to choose how you discover events
Personalized Service Health can fire an alert to an extensive array of destinations when a Google Cloud service disruption is posted or updated. You can choose which of these you would like to be alerted on, where, and customize the alert content to include critical information about the incident — including the affected Google services and locations, current relevance to your project, observable symptoms, and known mitigations.
You can configure alerts directly in Personalized Service Health, in Cloud Monitoring, or via Terraform. Each alert can be fired to one or more destinations, including email, SMS, Pub/Sub, webhook, or PagerDuty. You can also create multiple alerts for a single project for a higher degree of granularity.
Personalized Service Health is designed to publish information related to disruptions that may affect your projects with various degrees of relevance. By definition, this approach may provide you more information than what you think is strictly necessary. To strike a balance, you can filter the incidents to only see what you may deem relevant, across a variety of integration points:
Dashboard: Filter the incident table by any displayed field and incident recency.
Alerts: You can create a conditional alerting policy with any incident field, including Google Cloud products, locations, or relevance to your project.
API: You can use request filters in your API requests to further filter events programmatically in your application.
Logs: Cloud Logging supports a robust query language to filter logs as they are routed to another destination through a log sink.
Integrate with your incident management workflow
Incident response can span many people, teams, and tools in an organization. Personalized Service Health aims to fit into your existing incident response processes by offering several integration options depending on your preference for programmatic access, proactive versus reactive interactions, and existing tools.
You can use Personalized Service Health as a dashboard directly from the Google Cloud console, or fit it into any existing incident response or monitoring tool in your preferred workflow. The Service Health dashboard provides a list of active incidents relevant to your project, and, for each incident, you can see impact details about the incident or track updates from Google Cloud support. This is quick to set up and easy to maintain.
If you’re integrating Personalized Service Health with an external alerting, monitoring, or incident response tool, the Service Health API offers programmatic access to all incidents relevant to a specific project or for all projects across your organization. The API provides programmatic access to the complete list of all relevant incidents, updates from Google Cloud, and description of impact.
Build a history, report, and learn from past disruptions
When a service disruption begins, Cloud Logging collects Personalized Service Health logs for all updates to the event. To build up a historic record of events, you can retain logs in a storage bucket. You can also use Log Analytics with BigQuery to analyze past service disruptions.
Integrate once and enjoy benefits that get even better over time
As of today, we are excited to announce Personalized Service Health is integrated with 50+ Google Cloud products & services – including Compute Engine, Cloud Storage, all Cloud Networking offerings, BigQuery, Google Kubernetes Engine, and many more. If any integrated Google Cloud product detects a disruption that may impact you, Personalized Service Health provides an impact assessment, and shares updates including symptoms, known workarounds, or an ETA for resolution.
Some products may offer more advanced capabilities through Personalized Service Health, including faster initial posting, definitive impact signals, and may post small blast-radius incidents not posted on the public Google Cloud Service Health dashboard. Here is the complete list of integrated products and supported capabilities; we expect the list of supported Google Cloud products and capabilities will expand over time.
From our customers and partners
"The instinct for cloud providers is to be overly cautious about sharing outages too quickly. I’d rather proactively move a workload and learn there was no issue than the workload go down unknowingly. We’re happy to see Google Cloud make this step to be more transparent with customers and look forward to leveraging PSH."
- Justin Watts, Director Information Services & Technology Strategy, Telus
“Proactive alerts from Personalized Service Health to responders are critical to any enterprise customer’s incident response process. The PagerDuty and Google Cloud partnership is able to provide our customers an essential platform for modern operations that helps them quickly respond to cloud disruptions and deliver seamless digital experiences."
-Jonathan Rende, SVP Products, PagerDuty
Get started today
Reliable infrastructure is essential for workloads in the cloud and we’re continuously raising the bar on reliability through technology, product, and process innovation. A key component of reliability is the speed and effectiveness of incident response. During a cloud service incident, however unlikely, excellent communications are vital. Personalized Service Health provides the information you need to take your incident response communications to the next level, so can quickly assess what is happening, take actions to minimize impact to your applications, and keep your stakeholders informed. To get started, enable Personalized Service Health for a project or across your organization.