Google Cloud Platform
How SREs find the landmines in a service - CRE life lessons
In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.
Onboarding reviewQ: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?
That’s a great approach — if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support — and whether it’s in a supportable state — is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it — you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they're unlikely to be active participants in making the service materially more reliable.
At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:
- SRE entrance review: where an SRE team assess whether a developer-supported service should be onboarded by SRE, and what the onboarding preconditions should be.
- SRE onboarding/takeover: where a dev and SRE team agree in principle that the SRE team should take on primary operational responsibility for a service, and start negotiating the exact conditions for takeover (how and when the SREs will onboard the service).
- Developers want someone else to pick up support for the service, and make it run as well as possible. They want users to feel that the service is working properly, otherwise they'll move to a service run by someone else.
- The SRE team wants to be sure that they're not being “sold a pup” with a hard-to-support service, and have a vision for making the production service lower in toil and more robust.
- Meanwhile the company management wants to reduce the number of embarrassing service outages, as long as it doesn’t cost them too much in engineer time.
The SRE entrance reviewDuring an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:
- Assess how the service would benefit from SRE ownership
- Identify service design, implementation and operational deficiencies that could be a barrier to SRE takeover
- And if SRE ownership is determined to be beneficial, identify the bug fixes, process changes and necessary service behavior needed before onboarding the service
The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.
A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.
In the review phase, SREs aim to understand:
- what the service does
- day-to-day service operation (traffic variation, releases, experiment management, config pushes)
- how the service tends to break and how this manifests in alerts
- rough edges in monitoring and alerting
- where the service configuration diverges from the SRE team’s practices
- major operational risks for the service
The SRE team also considers:
- whether the service follows SRE team best practices, and if not, how to retrofit it
- how to integrate the service with the SRE team’s existing tools and processes
- the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?
The SRE takeoverThe SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).
There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.
The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.
In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.
If there’s excessive manual work involved in supporting a service, automation usually solves the problem.
The dominant concern with most services undergoing SRE takeover is the paging rate — how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.
Generally, excessive paging rates are the result of one of three things:
- Paging on something that’s not intrinsically important e.g., task restart or hitting 80% capacity of disk. Instead, downgrade the page to a bug (if it’s not urgent) or eliminate it entirely. Moving to symptom-based monitoring (“users are actually seeing problems”) can help improve this situation.
- Page storms where one small incident/outage generates many pages. Try to group related pages for an incident into a single outage, to get a clearer picture of the system’s outage metrics.
- A system that’s having too many genuine problems. In this case SRE takeover in the near future is unlikely, but SREs may be able to help diagnose and resolve the root causes of the problems.
More general ways to improve the service might include:
- integrating the service with standard SRE tools and practices e.g., load shedding, release processes and configuration pushes
- extending and improving playbook entries to rely less on the developer team’s tribal knowledge
- aligning the service’s configurations with the SRE team’s common languages and infrastructure
Smoothing the pathSREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.
One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”
Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.
Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).
At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.
See part 3 of this series, "Making the most of an SRE service takeover - CRE life lessons."