A Google SRE explores GitHub reliability with BigQuery
Data can tell wonderful stories, particularly when visualized in interesting ways. Felipe Hoffa told a great one recently when he visualized 8 years of reddit’s activity and uptime using BigQuery. As a fledgling SRE on a Mission Control rotation, I found myself asking “If I were an SRE responsible for this service, how would I approach this problem?”
Well, Felipe is at it again, this time with some historical GitHub data, and he asked me to play along, bringing an SRE’s point of view. First, let’s figure out if we think it’s possible to infer the health of GitHub based on the rate of events we can see in the GitHub Archive data in BigQuery. There are a variety of different kinds of events, but what we care about for this analysis is that each event represents a successful request to GitHub.
Using this query:
#StandardSQL SELECT TIMESTAMP_TRUNC(created_at, MINUTE) minute, COUNT(*) FROM `githubarchive.month.201607` GROUP BY 1 ORDER BY 1
We can find the number of GitHub events that happened each minute in the month of July 2016. The created_at field contains a timestamp in microseconds, which the query truncates to one minute granularity. This allows us to use the COUNT aggregate function to tally the number of events in each minute when we GROUP BY the timestamp. A simple visualization of the query results produces this graph:
There are a few interesting points in the graph where the number of events is unusually low, but it’s hard to tell from this graph exactly what a “normal” vs “abnormal” minute looks like. However, generating a histogram from the data makes things a little clearer.
This graph strongly suggests that anything below about 200 events per minute is abnormal for GitHub, at least in July of 2016. Let’s assume that extremely few events per minute is not driven by abnormally few end user requests, but instead by a problem with the service itself. Two possible explanations are that the requests were not reaching the service, or the service was not able to respond to them successfully. This gives us a signal that approximates when GitHub was “healthy” vs. “unhealthy.”
What we’ve done here is define a “Service Level Indicator” or SLI. To quote the SRE book, an SLI is “a carefully defined quantitative measure of some aspect of the level of service that is provided.” In this case the SLI is “number of successful events per minute." What we’d like to do is transform this SLI into something that tells us if the service is “up” or “down." The simplest way to do this is to choose a threshold for “events per minute,” below which we will consider the service “down." But what threshold should we choose?
A total outage would result in no events being recorded for a particular minute. So we can detect total outages by setting our “down” threshold at 0. However, most outages are not total outages. The service gets into some kind of unhealthy state and begins serving an abnormally high rate of errors, which manifests in the data as an unusually low rate of events (successful requests).
Thinking about total outages where no events are recorded reveals a problem with the simple query we used above. Minutes where no events are recorded do not show up in the results! It completely ignores total outages. The query below shows how to address this issue. It generates a table that includes all the minutes of the month we care about, joins it with the results of the query above, then maps any NULL results (which mean “no rows”) to zero events:
#StandardSQL SELECT b.minute m, IFNULL(c, 0) c FROM ( # first day of month SELECT i*60 + UNIX_SECONDS(TIMESTAMP('2016-07-01')) minute FROM `fh-bigquery.geocode.numbers_65536` WHERE i < 60*24* # number of days in month EXTRACT(DAY FROM DATE_SUB(DATE_ADD(DATE('2016-07-01'), INTERVAL 1 MONTH), INTERVAL 1 DAY)) ) b LEFT JOIN ( SELECT UNIX_SECONDS(TIMESTAMP_TRUNC(created_at, MINUTE)) minute, COUNT(*) c FROM `githubarchive.month.201607` GROUP BY 1 ) a ON a.minute=b.minute ORDER BY b.minute
To make the rest of our queries simpler, we save this as a view named
events_per_minute_201607, and reference that view in the following queries.
Now that we have a view that includes minutes where there were zero events, we can use a quantile function to explore the shape of the data near the low end of the events per minute range.
#StandardSQL SELECT APPROX_QUANTILES(c, 10001) as q FROM `fh-bigquery.public_dump.events_per_minute_201607`
The “10001” parameter to APPROX_QUANTILE function means that the data has been divided into 10,000 buckets, each with the same number of data points. Non-statistics majors may be familiar with the specific case of a quantile with 100 buckets, which is usually referred to as a percentile. For example, if you’re in the 95th percentile on a test, that means that 5% of the test takers did better than you, 94% did worse and 1% did about the same.
Here are the first 20 results from the query:
And here are the same 20 results in bar graph form:
What we're looking for here are points that fall outside the “normal” range, since we presume that outages are abnormal. We want to choose a threshold that's low enough that it won’t misinterpret a slow minute as downtime, but high enough that it will detect significantly degraded operations. There seems to be a “state change” somewhere between 20 and 77 events per minute, which might indicate that below 20 events per minute the service is almost completely “down."
While the analysis above is far from conclusive, let’s choose a threshold of 20 events per minute to declare the service “down,” and see what we find. We can see how many minutes the service was down in July of 2016 with this query:
#StandardSQL SELECT m, c FROM `fh-bigquery.public_dump.events_per_minute_201607` WHERE c <= 20
The query returns 54 rows, thus the service, according to our definition, was “down” for 54 minutes in July. But is 54 minutes good? Or bad?
To answer that, we need to decide what an acceptable amount of downtime is. Uptime goals are often expressed in terms of how many “nines” you aim to provide, e.g., 99% uptime, 99.9% uptime, etc. A 99.9% uptime goal implies that you can only be down 0.1% of the time. But for such a statement to be meaningful, a time period must be selected. One month is a common choice. Thus a precise statement of the goal might be “the service will not be down for more than 0.1% of minutes in a calendar month."
Adding a goal like this to an SLI transforms it into a Service Level Objective, or SLO. There were 44,640 minutes in July 2016. This allows up to 44.6 minutes of downtime before the SLO is missed. This is often referred to as the “error budget.”
Thus we can say that in July of 2016, the service consumed 54 minutes of its 44 minute error budget, and thus, it would have missed a 99.9% SLO by a margin of 10 minutes.
A typical way to visualize an SLO is to accumulate error minutes over the chosen period of time, in this case, the 31 days of July. The following query produces the cumulative downtime over the month of July.
#StandardSQL SELECT m, SUM(down) OVER(ORDER BY m) cumulative_down FROM ( SELECT m, c, 1 as down FROM `fh-bigquery.public_dump.events_per_minute_201607` WHERE c <= 20 )
When plotted, this query produces the chart below. The steps in this graph correspond to “outages.” The longer the outage, the bigger the step. When the line goes above 44, we've exceeded the error budget, and the service is “out of SLA” for the 30 day window.
Our analysis indicates that most of the error budget was consumed by a pair of outages, one in the middle of the month, and one near the end. And indeed, if we look at GitHub’s status archive, we can confirm that GitHub reported an “Increased exception rate” on July 12th, and a “Major service outage” on July 27th.
While SLOs are sometimes expressed in terms of calendar months, such a definition doesn't always bring an engineer joy, for two reasons. First a “month” is not a fixed time span. It can have anywhere between 28 and 31 days, depending on which month it happens to be. Second, it makes certain days special, when the error budget gets magically reset. For example, a service with an SLO of 99.9% uptime per calendar month could suffer a 44-minute outage on July 31st, and another 44-minute outage on Aug 1st, and still be just barely “within SLO,” as long as there were no other outages in those months, despite the fact that the service suffered 88 minutes of downtime over a 48-hour period.
To address these inconsistencies, SLOs are often expressed in terms of a 30-day sliding window, instead of a calendar month. Ignoring leap seconds for a moment, 30 days is a fixed time span, unlike a calendar month. And using a sliding window also addresses the second problem discussed above. Two hypothetical 44-minute outages on July 31st and Aug 1st would put the service “out of SLO” from the time of the Aug 1st outage until July 31st outage drops out of the sliding window.
This is important, because being “out of SLO” should cause the teams responsible for the service to apply more conservative decision making. For example, if you are “out of SLO”, it’s probably not the best time to perform a major software upgrade. 88 minutes of downtime over a 48 hour period should make a team that has an SLO of 99.9% apply conservative decision making. The first SLO definition based on calendar months would not enforce that, whereas the second SLO definition based on a 30 day sliding window would.
BigQuery makes it easy to query across multiple date-sharded tables, so we can create similar views for June, July and August of 2016, and we use the OVER clause to calculate aggregates over a sliding window of 43,200 minutes (30 days):
When we plot this new sliding window data, we can clearly see the period of time where the service was out of SLO (i.e. where the graph goes above 43) from late July into early August. We also clearly see it return to being within SLO part way through August, at which point it becomes appropriate to resume “business as usual” with respect to the service.
#StandardSQL SELECT m, cumulative_down, next_cum FROM ( SELECT m, cumulative_down, LEAD(cumulative_down) OVER(ORDER BY m) next_cum, MIN(m) OVER() min_m, MAX(m) OVER() max_m FROM ( SELECT m, SUM(IF(c <= 20, 1, 0)) OVER(ORDER BY m ROWS BETWEEN 43200 PRECEDING AND CURRENT ROW) cumulative_down FROM ( SELECT m, c, 1 as down FROM `fh-bigquery.public_dump.events_per_minute_201606` UNION ALL SELECT m, c, 1 as down FROM `fh-bigquery.public_dump.events_per_minute_201607` UNION ALL SELECT m, c, 1 as down FROM `fh-bigquery.public_dump.events_per_minute_201608`) ) ) WHERE cumulative_down!=next_cum OR m IN (min_m, max_m) ORDER BY m
Throughout this conversation, we have not yet mentioned a Service Level Agreement (SLA). That is quite intentional. SLIs are about indicating the health of a service. SLOs are about providing guidance on when it's appropriate to perform riskier operations such as major service upgrades. SLAs are a legal agreement between a service provider and their customers, and generally specify monetary penalties, such as service credit, for SLA violations. An SLA should be related to one or more SLOs, since the SLOs guide how you operate your service, but the SLA is often different in the details and thresholds. For example, legal contracts are often written in terms of calendar months for simplicity, even though as we’ve observed here that a 30-day sliding window is a better tool for guiding how to operate the service. Also, the SLO threshold is usually more conservative than the SLA threshold. For example, the SLO threshold might be 99.9%, but the threshold at which penalties are invoked in an SLA might be 99.5%.
In the specific example of GitHub, their terms of service make no mention of an “SLA." They in fact state that “The service is provided on an "as is" and "as available" basis.” Does this lack of SLA mean that GitHub does not care about reliability? Of course not! The fact that they publish a status page shows that they not only measure and track their reliability, but also are transparent about it.
At this point, we should note that in order to have something interesting to show in our graphs, we pre-explored the data a little using ad-hoc queries to find a relatively recent month that showed an outage or two, and we confirmed the outages were genuine using GitHub’s status archive. If we had instead selected the month of August, this would’ve been a very boring blog post, since GitHub had 100% reliability in August. Nice work GitHub!