Google Cloud Platform

Fearless shared postmortems — CRE life lessons

We here on Google’s Site Reliability Engineering (SRE) teams have found that writing a blameless postmortem — a recap and analysis of a service outage — makes systems more reliable, and helps service owners learn from the event.

Postmortems are easy to do within your company — but what about sharing them outside your organization? Why indeed, would you do this in the first place? It turns out that if you're a service or platform provider, sharing postmortems with your customers can be good for you and them too.

In this installment of CRE Life Lessons, we discuss the benefits and complications that external postmortems can bring, and some practical lessons about how to craft them.

Well-known external postmortems 

There is prior art, and you should read it. 

Over the years, we’ve had our share of outages and recently, we’ve been sharing more detail about them than we used to. For example, on April 11 2016 Google Compute Engine dropped inbound traffic, resulting in this public incident report.

Other companies are also publishing detailed postmortems about their own outages. Who can forget the time when:

We in Google SRE love reading these postmortems — and not because of schadenfreude. Indeed, many of us read them and think “there but for the grace of (Deity) go we” and wonder whether we would withstand a similar failure. Indeed, when you’re thinking this, it’s a good time to run a DiRT exercise.

For platform providers that offer a wide range of services to a wide range of users, fully public postmortems such as these make sense (even though they're a lot of work to prepare and open you up to criticism from competitors and press). But even if the impact of your outage isn’t as broad, if you are practising SRE, it can still make sense to share postmortems with customers that have been directly impacted. Caring about your customers’ reliability means sharing the details of your outages.

This is the position we take on the Google Cloud Platform (GCP) Customer Reliability Engineering. To help customers run reliably on GCP, we teach them how to engineer increased reliability for their service by implementing SRE best practices in our work together. We identify and quantify architectural and operational risks to each customer’s service, and work with them to mitigate those risks and drive to sustain system reliability at their SLO (Service Level Objectives) target.

Specifically, the CRE team works with each customer to help them meet the availability target expressed by their SLOs. For this, the principal steps are to:

  1. Define a comprehensive set of business-relevant SLOs
  2. Get the customer to measure compliance to those SLOs in their monitoring platform (how much of the service error budget has been consumed)
  3. Share that live SLO information with Google support and product SRE teams (which we term shared monitoring)
  4. Jointly monitor and react to SLO breaches with the customer (shared operational fate)
If you run a platform — or some approximation thereof — then you too should practice SRE with your customers to get that increased reliability, prevent your customers from tripping over your changes, and gain better insights into the impact and scope of your failures.

Then, when an incident occurs that causes the service to exceed its error budget — or consumes an unacceptably high proportion of the error budget — the service owner needs to determine:

  1. How much of the error budget did this consume in total?
  2. Why did the incident happen?
  3. What can / should be done to stop it from happening again? 
Answering Question #1 is easy, but the mechanism for evaluating Questions #2 and #3 is a postmortem. If the incident root cause was purely on the customer’s side, that’s easy — but what if the trigger was an event on your platform side? This is when you should consider an external postmortem.

Foundations of an external postmortem

Analyzing outages — and subsequently writing about them in a postmortem — benefits from having a two-way flow of monitoring data between the platform operator and the service owner, which provides an objective measure of the external impact of the incident: When did it start, how long did it last, how severe was it, and what was the total impact on the customer’s error budget? Here on the GCP CRE team, we have found this particularly useful, since it's hard to estimate the impact of problems in lower-level cloud services on end users. We may have observed a 1% error rate and increased latency internally, but was it noticeable externally after traveling through many layers of the stack?

Based on the monitoring data from the service owner and their own monitoring, the platform team can write their postmortem following the standard practices and our postmortem template. This results in an internally reviewed document that has the canonical view of the incident timeline, the scope and magnitude of impact, and a set of prioritized actions to reduce the probability of occurrence of the situation (increased Mean Time Between Failures), reduce the expected impact, improve detection (reduced Mean Time To Detect) and/or recover from the incident more quickly (reduced Mean Time To Recover).

With a shared postmortem, though, this is not the end: we want to expose some —though likely not all — of the postmortem information to the affected customer.

Selecting an audience for your external postmortem

If your customers have defined SLOs, they (and you) know how badly this affected them. Generally, the greater the error budget that has been consumed by the incident, the more interested they are in the details, and the more important it will be to share with them. They're also more likely to be able to give relevant feedback to the postmortem about the scope, timing and impact of the incident, which might not have been apparent immediately after the event.

If your customer’s SLOs weren’t violated but this problem still affected their customers, that’s an action item for the customer’s own postmortem: what changes need to be made to either the SLO or its measurements? For example, was the availability measurement further down in the stack compared to where the actual problem occurred?

If your customer doesn’t have SLOs that represent the end-user experience, it’s difficult to make an objective call about this. Unless there are obvious reasons why the incident disproportionately affected a particular customer, you should probably default to a more generic incident report.

Another factor you should consider is whether the customers with whom you want to share the information are under NDA; if not, this will inevitably severely limit what you're able to share.

If the outage has impacted most of your customers, then you should consider whether the externalized postmortem might be the basis for writing a public postmortem or incident report, like the examples we quoted above. Of course, these are more labor-intensive than external postmortems shared with select customers (i.e., editing the internal postmortem and obtaining internal approvals), but provide additional benefits.

The greatest gain from a fully public postmortem can be to restore trust from your user base. From the point of view of a single user of your platform, it’s easy to feel that their particular problems don’t matter to you. A public postmortem gives them visibility into what happened to their service, why, and how you're trying to prevent it from happening again. It’s also an opportunity for them to conduct their own mini-postmortem based on the information in the public post, asking themselves “If this happened again, how would I detect it and how could I mitigate the effects on my service?”

Deciding how much to share, and why?

Another question when writing external postmortems is how deep to get into the weeds of the outage. At one end of the spectrum you might share your entire internal postmortem with a minimum of redaction; at the other you might write a short incident summary. This is a tricky issue that we’ve debated internally.

The two factors we believe to be most important in determining whether to expose the full detail of a postmortem to a customer, rather than just a summary, are:

  1. How important are the details to understanding how to defend against a future re-occurrence of the event?
  2. How badly did the event damage their service, i.e., how much error budget did it consume? 
As an example, if the customer can see the detailed timeline of the event from the internal postmortem, they may be able to correlate it with signals from their own monitoring and reduce their time-to-detection for future events. Conversely, if the outage only consumed 8% of their 30-day error budget then all the customer wants to know is whether the event is likely to happen more often than once a month.

We have found that, with a combination of automation and practice, we can produce a shareable version of an internal postmortem with about 10% additional work, plus internal review. The downside is that you have to wait for the postmortem to be complete or nearly complete before you start. By contrast, you can write an incident report with a similar amount of effort as soon as the postmortem author is reasonably confident in the root cause.

What to say in a postmortem 

By the time the postmortem is published, the the incident has been resolved, and the customer really cares about three questions:

  1. Why did this happen? 
  2. Could it have been worse? 
  3. How can we make sure it won’t happen again?

“Why did this happen?” comes from the “Root causes and Trigger” and “What went wrong” sections of our postmortem template. “Could it have been worse?” comes from “Where we got lucky.”
These are two sections which you should do your best to retain as-is in an external postmortem, though you may need to do some rewording for clarity.

“How can we make sure it won’t happen again” will come from the Action items table of the postmortem.

What not to say

With that said, postmortems should never include these three things:

  1. Names of humans - Rather than “John Smith accidentally kicked over a server”, say “a network engineer accidentally kicked over a server,” Internally, we try to express the role of humans in terms of role rather than name. This helps us keep a blameless postmortem culture
  2. Names of internal systems - The names of your internal systems are not clarifying for your users and creates a burden on them to discover how these things fit together. For example, even though we’ve discussed Chubby externally, we still refer to it in postmortems we make external as “our globally distributed lock system.”
  3. Customer-specific information - The internal version of your postmortem will likely say things like “on XX:XX, Acme Corp filed a Support ticket alerting us to a problem.” It’s not your place to share this kind of detail externally as it may create an undue burden for the reporting company (in this case Acme Corp.). Rather, simply say “on XX:XX, a customer filed…”. If you’re going to reference more than one customer, then just label them Customer A, Customer B, etc..

Other things to watch out for

Another source of difficulty when rewriting a shared postmortem is the “Background” section that sets the scene for the incident. An internal postmortem assumes the reader has basic knowledge of the technical and operational background; this is unlikely to be true for your customer. We try to write the least detailed explanation that still allows the reader to understand why the incident happened; too much detail here is more likely to be off-putting than helpful.

Google SREs are fans of embedding monitoring graphs in postmortems; monitoring data is objective and doesn’t generally lie to you (although our colleague Sebastian Kirsch has some very useful guidance as to when this is not true). When you share a postmortem outside the company, however, be careful what information these graphs reveal about traffic levels and number of users of a service. Our rule of thumb is to leave the X axis (time) alone, but for the Y axis either remove the labels and quantities all together, or only show percentages. This is equally true for incorporating customer-generated data in an internal postmortem.

A side note on the role of luck

With apologies to Tina Turner, What’s luck got to do, got to do with it? What’s luck but a source of future failures?

As well as “What went well” and “What went badly” our internal postmortem template includes the section “Where we got lucky.” This is a useful place to tease out risks of future failures that were revealed by an incident. In many cases an incident had less impact than it might have, because of relatively random factors such as timing, presence of a particular person as the on-call, or co-incidence with another outage that resulted in more active scrutiny of the production systems than normal.

“Where we got lucky” is an opportunity to identify additional action items for the postmortem, e.g.,

  • “the right person was on-call” implies tribal knowledge that needs to be fed into a playbook and exercised in a DiRT test
  • “this other thing (e.g., a batch process or user action) wasn’t happening at the same time” implies that your system may not have sufficient surplus capacity to handle a peak load, and you should consider adding resources
  • “the incident happened during business hours” implies a need for automated alerting and 24-hour pager coverage by an on-call
  • “we were already watching monitoring” implies a need to tune alerting rules to pick up the leading edge of a similar incident if it isn’t being actively inspected.
Sometimes teams also add “Where we got unlucky,” when the incident impact was aggravated by a set of circumstances that are unlikely to re-occur. Some examples of unlucky behavior are:

  • an outage occurred on your busiest day of the year
  • you had a fix for the problem that hadn't been rolled out for other reasons
  • a weather event caused a power loss.

A major risk in having a “Where we got unlucky” category is that it’s used to label problems that aren’t actually due to blind misfortune. Consider this example from an internal postmortem:

Where we got unlucky 

There were various production inconsistencies caused by past outages and experiments. These weren’t cleaned up properly and made it difficult to reason about the state of production.

This should instead be in “What went badly,” because there are clear action items that could remediate this for the future.

When you have these unlucky situations, you should always document them as part of "What went badly," while assessing the likelihood of them happening again and determining what actions you should take. You may choose not to mitigate every risk since you don’t have infinite engineering time, but you should always enumerate and quantify all the risks you can see so that “future you” can revisit your decision as circumstances change.

Summary

Hopefully we've provided a clear motivation for platform and service providers to share their internal postmortems outside the company, at some appropriate level of detail. In the next installment, we'll discuss how to get the greatest benefit out of these postmortems.