Google Cloud Platform
Making the most of an SRE service takeover—CRE life lessons
In Part 2 of this blog post we explained what an SRE team would want to learn about a service angling for SRE support, and what kind of improvements they want to see in the service before considering it for take-over. And in Part 1, we looked at why an SRE team would or wouldn’t choose to onboard a new application. Now, let’s look at what happens once the SREs agree to take on the pager.
Onboarding preparationIf a service entrance review determines that the service is suitable for SRE support, developers and the SRE team move into the “onboarding” phase, where they prepare for SREs to support the service.
While developers address the action items, the SRE team starts to familiarize itself with the service, building up service knowledge and familiarity with the existing monitoring tools, alerts and crisis procedures. This can be accomplished through several methods:
- Education: present the new service to the rest of the team through tech talks, discussion sessions and "wheel of misfortune" scenarios.
- “Take the pager for a spin”: share pager alerts with the developers for a week, and assess each page on the axes of criticality (does this indicate a user-impacting problem with the service?) and actionability (is there a clear path for the on-call to to resolve the underlying issue?). This gives the SRE team a quantitative measure of how much operational load the service is likely to impose.
- On-call shadow: page the primary on-call developer and SRE at the same time. At this stage, responsibility for dealing with emergencies rests on the developer, but the developer and the SRE collaborate on debugging and resolving production issues together.
Q: I’ve gone through a lot of effort to make my service ready to hand over to SRE. How can I tell whether it was a good expenditure of scarce engineering time?
If the developer and SRE teams have agreed to hand over a system, they should also agree on criteria (including a timeframe) to measure whether the handover was successful. Such criteria may include (with appropriate numbers):
- Absolute decrease of paging/outages count
- Decreasing paging/outages as a proportion of (increasing) service scale and complexity.
- Reduced time/toil from the point of new code passing tests to being deployed globally, and a flat (or decreasing) rollback rate.
- Increased utilization of reserved resources (CPU, memory, disk etc.)
Taking over the pager
Once all the blocking action items have been resolved, it’s time for SREs to take over the service pager. This should be a "no drama" event, with few, well-documented service alerts, that can be easily resolved by following procedures in the service playbook.
In theory, the SRE team will have identified most of these issues in the entrance review phase, but realistically there any many issues that are only apparent with sustained exposure to a service.
In the medium term (one to two months), SREs should build a list of deficiencies or areas for optimization in the system with regard to monitoring, resource consumption etc. This hitlist should primarily aim to reduce SRE “toil” (manual, repetitive, tactical work that has no enduring value), and secondarily fix aspects of the system, e.g., resource consumption or cruft accumulation, which can impact system performance. Tertiary changes may include things like updating the documentation to facilitate onboarding new SREs for system support.
In the long term (three to six months), SREs should expect to meet most or all of the pre-established measurements for takeover success as described above.
Q: That’s great, so now my developers can turn off their pager?
Not so fast, my friend. Although the SRE team has learned a lot about the service in the preceding months, they're still not experts; there will inevitably be failure modes involving arcane service behavior where the SRE on-call will not know what has broken, or how to fix it. There's no substitute for having a developer available, and we normally require developers to keep their on-call rotation so that the SRE on-call can page them if needed. We expect this to be a low rate of pages.
The nuclear option — handing back the pager
Not all SRE takeovers go smoothly, and even if the SREs have taken over the pager for a service, it’s possible for reliability to regress or operational load to increase. This might be for good reasons such as a “success disaster” — a sustained and unexpected spike in usage — or for bad reasons such as poor QA testing.
An SRE team can only handle so many services, and if one service starts to consume a disproportionate amount of SRE time, it's at risk of crowding out other services. In this case, the SRE team should proactively tell the developer team that they have a problem, and should do so in a neutral way that’s data-heavy:
In the past month we’ve seen 100 pages/week for service S1, compared to a steady rate of 20-30 pages/week over the past few weeks. Even though S1 is within SLO, the pages are dominating our operational work and crowding out service improvement work. You need to do one of the following:
- bring S1’s paging rate down to the original rate by reducing S1’s rate of change
- de-tune S1’s alerts so that most of them no longer page
- tell us to drop SRE support for services S2, S3 so our overall paging rate remains steady
- tell us to drop SRE support for S1
There are also times when developers and SREs agree that handing back the pager to developers is the right thing to do, even if the operational load is normal. For example, imagine SREs are supporting a service, and developers come up with a new, shiny, higher-performing version. Developers support the new version initially, while working out its kinks, and migrate more and more users to it. Eventually the new version is the most heavily used — this is when SREs should take on the pager for the new service and hand the old service’s pager back to developers. Developers can then finish user migrations and turn down the old service at their convenience.
Converging your SRE and dev teamsOnboarding a service is about more than transferring responsibility from developers to SREs — it also improves mutual understanding between the two teams. The dev team gets to know what the SRE team does, and why, who the individual SREs are, and perhaps how they got that way. Similarly the SRE team gains a better understanding of the development team’s work and concerns. This increase in empathy is a Good Thing in itself, but is also an opportunity to improve future applications.
Now, when a developer team designs a new application or service, they should take the opportunity to invite the SRE team to the discussion. SRE teams can easily spot reliability issues in the design, and advise developers on ways to make the service easier to operate, set up good monitoring and configure sensible rollout policies from the start.
Similarly, when the SREs do future planning or design new tooling, they should include developers in the discussions; developers can advise them on future launches and projects, and give feedback on making the tools easier to operate or a better fit for developers’ needs.
Imagine that there was a brick wall between the SRE and developer teams; our original plan for service takeover was to throw the service over the wall and hope. Over the course of these blog posts, we’ve shown you how to make a hole in the wall so there can be two-way communication as the service is passed through, then expand it into a doorway so that SREs can come into the developers’ backyard and vice versa. Eventually, developers and SREs should tear down the wall entirely, and replace it with a low hedge and ornamental garden arch. SREs and developers should be able to see what’s going on in each others’ yard, and wander over to the other side as needed.