Identifying and tracking toil using SRE principles
One of the key measures that Google site reliability engineers (SREs) use to verify our effectiveness is how we spend our time day-to-day. We want ample time available for long-term engineering project work, but we’re also responsible for the continued operation of Google’s services, which sometimes requires doing some manual work. We aim for less than half of our time to be spent on what we call “toil.” So what is toil, and how do we stop it from interfering with our engineering velocity? We’ll look at these questions in this post.
First, let’s define toil, from chapter 5 of the Site Reliability Engineering book:
“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Some examples of toil may include:
Handling quota requests
Applying database schema changes
Reviewing non-critical monitoring alerts
Copying and pasting commands from a playbook
A common thread in all of these examples is that they do not require an engineer’s human judgment. The work is easy but it’s not very rewarding, and it interrupts us from making progress on engineering work to scale services and launch features.
Here’s how to take your team through the process of identifying, measuring, and eliminating toil.
The hardest part of tackling toil is identifying it. If you aren’t explicitly tracking it, there’s probably a lot of work happening on your team that you aren’t aware of. Toil often comes as a request texted to you or email sent to an individual who dutifully completes the work without anyone else noticing. We heard a great example of this from CRE Jamie Wilkinson in Sydney, Australia, who shared this story of his experience as an SRE on a team managing one of Google’s datastore services.
Jamie’s SRE team was split between Sydney and Mountain View, CA, and there was a big disconnect between the achievements of the two sites. Sydney was frustrated that the project work they relied upon—and the Mountain View team committed to—never got done. One of the engineers from Sydney visited the team in Mountain View, and discovered they were being interrupted frequently throughout the day, handling walk-ups and IMs from the Mountain View-based developers.
Despite regular meetings to discuss on-call incidents and project work, and complaints that the Mountain View side felt overworked, the Sydney team couldn’t help because they didn’t know the extent of these requests. So the team decided to require all the requests to be submitted as bugs. The Mountain View team had been trained to leap in and help with every customer's emergency, so it took three months just to make the cultural change. Once that happened, they could establish a rotation of people across both sites to distribute load, see stats on how much work there was and how long it took, and identify repetitive issues that needed fixing.
“The one takeaway from this was that when you start measuring the right thing, you can show people what is happening, and then they agree with you,” Jamie said. “Showing everyone on the team the incoming vs. outgoing ticket rates was a watershed moment.”
When tracking your work this way, it helps to gather some lightweight metadata in a tracking system of your choice, such as:
What type of work was it (quota changes, push release to production, ACL update, etc.)?
What was the degree of difficulty: Easy (<1 hour); Medium (hours); Hard (days) (based on human hands-on time, not elapsed time)?
Who did the work?
This initial data lets you measure the impact of your toil. Remember, however, that the emphasis is on lightweight in this step. Extreme precision has little value here; it actually places more burden on your team if they need to capture many details, and makes them feel micromanaged.
Another way to successfully identify toil is to survey your team. Another Google CRE, Vivek Rau, would regularly survey Google’s entire SRE organization. Because the size and shape of toil varied between different SRE teams, at a company-wide level ticket metrics were harder to analyze. He surveyed SREs every three months to identify common issues across Google that were eating away at our time for project work. Try this sample toil survey to start:
Averaging over the past four weeks, approximately what fraction of your time did you spend on toil?
How happy are you with the quantity of time you spend on toil?
Not happy / OK / No problem at all
What are your top three sources of toil?
On-call Response / Interrupts / Pushes / Capacity / Other / etc.
Do you have a long-term engineering project in your quarterly objectives?
Yes / No
If so, averaging over the past four weeks, approximately what fraction of your time did you spend on your engineering project? (estimate)
In your team, is there toil you can automate away but you don’t do so, because that very toil takes time away from long-term engineering work? If so, please describe below.
Once you’ve identified the work being done, how do you determine if it’s too much? It’s pretty simple: Regularly (we find monthly or quarterly to be a good interval), compute an estimate of how much time is being spent on various types of work. Look for patterns or trends in your tickets, surveys, and on-call incident response, and prioritize based on the aggregate human time spent. Within Google SRE, we aim to keep toil below 50% of each SRE’s time, to preserve the other 50% for engineering project work. If the estimates show that we have exceeded the 50% toil threshold, we plan work explicitly with the goal of reducing that number and getting the work balance back into a healthy state.
Now that you’ve identified and measured your toil, it’s time to minimize it. As we’ve hinted at already, the solution here is typically to automate the work. This is not always straightforward, however, and the aim shouldn’t be to eliminate all toil.
Automating tasks that you rarely do (for example, deploying your service at a new location) can be tricky, because the procedure you used or assumptions you made while automating may change by the time you do that same task again. If a large amount of your time is spent on this kind of toil, consider how you might change the underlying architecture to smooth this variability. Do you use an infrastructure as code (IaC) solution for managing your systems? Can the procedure be executed multiple times without negative side effects? Is there a test to verify the procedure?
Treat your automation like any other production system. If you have an SLO practice, use some of your error budget to automate away toil. Complete postmortems when your automation fails, and fix it as you would any user-facing system. You want your automation available to you in any situation, including production incidents, to free humans to do the work they’re good at.
If you’ve gotten your users familiar with opening tickets to request help, use your ticketing system as the API for automation, making the work fully self-service.
Also, because toil isn’t just technical, but also cultural, make sure the only people doing toil work are the people explicitly assigned to it. This might be your oncaller, or a rotation of engineers scheduled to deal with “tickets” or “interrupts.” This preserves the rest of the team’s time to work on projects and reinforces a culture of surfacing and accounting for toil.
A note on complexity vs. toil
Sometimes we see engineers and leadership mistaking technical or organizational complexity as toil. The effects on humans are similar, but the work fails to meet the definition at the start of this post. Where toil is work that is basically of no enduring value, complexity often makes valuable work feel onerous.
Google SRE Laura Beegle has been investigating this within Google, and suggests a different approach to addressing complexity: While there’s intense satisfaction in designing a simple, robust system, it inevitably becomes somewhat more complex, simply by existing in a distributed environment, used by a diverse range of users, or growing to serve more functionality over time. We want our systems to evolve over time, while also reducing what we call “experienced complexity”—the negative feelings based on mismatched expectations about how long or difficult a task is to complete. Quantifying the subjective experience of your systems is known by another name: user experience. The users in this case are SREs. The observable outcome of well-managed system complexity is a better user experience.
Addressing the user experience of supporting your systems is engineering work of enduring value, and therefore not the same as toil. If you find that complexity is threatening your system’s reliability, take action. By following a blameless postmortem process, or surveying your team, you can identify situations where complexity resulted in unexpected results or a longer-than-expected recovery time.
Some manual care and feeding of the systems we build is inevitably required, but the number of humans needed shouldn’t grow linearly with the number of VMs, users, or requests. As engineers, we know the power of using computers to complete routine tasks, but we often find ourselves doing that work by hand anyway. By identifying, measuring, and reducing toil, we can reduce operating costs and ensure time to focus on the difficult and interesting projects instead.