An update on Sunday’s service disruption
Benjamin Treynor Sloss
Yesterday, a disruption in Google’s network in parts of the United States caused slow performance and elevated error rates on several Google services, including Google Cloud Platform, YouTube, Gmail, Google Drive and others. Because the disruption reduced regional network capacity, the worldwide user impact varied widely. For most Google users there was little or no visible change to their services—search queries might have been a fraction of a second slower than usual for a few minutes but soon returned to normal, their Gmail continued to operate without a hiccup, and so on. However, for users who rely on services homed in the affected regions, the impact was substantial, particularly for services like YouTube or Google Cloud Storage which use large amounts of network bandwidth to operate.
For everyone who was affected by yesterday’s incident, I apologize. It’s our mission to make Google’s services available to everyone around the world, and when we fall short of that goal—as we did yesterday—we take it very seriously. The rest of this document explains briefly what happened, and what we’re going to do about it.
Incident, Detection and Response
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts.
Overall, YouTube measured a 2.5% drop of views for one hour, while Google Cloud Storage measured a 30% reduction in traffic. Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be! Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal.
With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration. We will then have a focused engineering sprint to ensure we have not only fixed the direct cause of the problem, but also guarded against the entire class of issues illustrated by this event.
We know that people around the world rely on Google’s services, and over the years have come to expect Google to always work. We take that expectation very seriously—it is our mission, and our inspiration. When we fall short, as happened Sunday, it motivates us to learn as much as we can, and to make Google’s services even better, even faster, and even more reliable.