How to build reliable systems (with unreliable components): A conversation
We published another episode of “VM End to End,” which is a series of curated conversations between a “VM skeptic” and a “VM enthusiast”. Every episode, join Brian, Carter, and a special guest as they explore why VMs are some of Google’s most trusted and reliable offerings, and how VMs benefit companies operating at scale in the cloud. Here is a transcript of the episode:
Carter: Hi, and welcome back to VM end-to-end, a show where we have a VM enthusiast and a VM skeptic, or former VM skeptic, hash out all things VMs. If you remember the show from last season, we talked about reliability, automation, cost, and all of the benefits that you can get from cloud-native VMs, but some parts still confuse me. So, I brought Brian back in to talk about it. Brian, hi. I need to know why I can't have just one giant super machine that just works instead of all these flimsy little parts.
Brian: That would be amazing, wouldn't it? You know, just kind of an infinitely large machine, but that is not a thing. And this season, we're going to bring some guests in to dig deeper into some of these concepts. So, we brought in Steve McGhee to tell us more about why that isn't a thing and how to get reliability out of things. Hey, Steve.
Steve: Hey guys. How's it going?
Carter: Hi, Steve. Glad to have you here. Can you tell us a little bit about yourself and who you are?
Steve: Sure. I'm Steve McGhee. I was an SRE in Google for about ten years, and then, I left, and I became a customer of cloud, and I learned how to cloud and how to use all the different clouds and just how difficult it is actually to kind of make decisions and stuff like that. And then, I came back to Google to help more customers do exactly that. I tend to focus on reliability because my background was in site reliability engineering.
Carter: See, this is great because you would think that having one piece of hardware you upgrade and put a lot of resources into will be a lot more reliable than having lots of smaller parts that have to communicate, which can all fail. So, where am I going wrong here?
Steve: Yeah. I mean, your intent is correct. Like it would be sweet if we just had a perfect, infinitely dense, infinitely fast computer. When you work on a system on your laptop…
You know the whole phrase "it works on my machine, like what's the problem?" When we go into production, we're scaling up, and we have to scale up because we don't want to have a service for one person. We, hopefully, want it for many people. Potentially, even around the world or something like that. So, being able to scale up into production means that we run into fundamental laws of physics. So, the speed of light comes into play and the density of materials. For example, it would be bad to go so dense that everything catches fire.
And so, that means we have to start spreading things out. It means we're taking what was one computer, and we're spreading it across time and space, in a way, by using many computers instead. I hope that helps understand what the fundamental problem is here.
Carter: It does because we talked about this a lot. Brian hammered home a point last season; he said a cloud VM is a slice of a data center, more so than a slice of one specific machine. And so, it seems like that's playing out here. But what I don't understand is that Google touts its reliability and being up for a long time. But how can it do that with so many parts that are constantly going to fail like a disk?
Steve: If you think about what it used to be like during like dot-com boom, we called these gold-plated servers. We would make the most expensive, solid server that you could ever have. And those were pretty awesome. But they were super expensive. And it turns out that putting all your eggs into the basket of making one machine extremely reliable gets you diminishing returns because you still have to have some downtime for maintenance and things like that.
And so, what we found inside of Google is that we wanted to horizontally scale, which is adding more machines at a time to achieve planet-wide scale, like having billions of users, for example. You can't fit billions of users on one super machine. So, once you start horizontally scaling, using the most expensive machine possible with a marginal return doesn't make sense.
And so, this is where we introduced resilience at many layers in the stack. And it turns out it's way better for the service; It's more economical and allows you to move faster, which sounds crazy, but it's true.
Brian: This shows up in many places in Google Cloud, specifically around VMs. The VMs themselves are insulated to some degree from the underlying systems through live migration. We've got different zones of separation between them. And we got kind of load balancers so that you can hit different pieces. But it feels like there's a core principle here or two. I don't know if you could talk a little bit more about that.
Steve: Yeah. I mean, I went to a school in computer science, so I consider myself like a computer scientist, I guess, or a software engineer. And like I think everyone who's super into computer science knows the one holy thing is like layers of abstraction. Right? Abstraction solves everything.
I jokingly refer to this as like the matryoshka doll, which is like those little nesting dolls you've seen where you have like, the very middle, the tiny little doll is like your CPU. And then there's a machine, and that has a VM around it. And that's in a data center which is part of a zone, which is, then, part of a region.
And so, at every layer of this stack, if you're able to perform some resilience engineering to make sure that you can take advantage of that layer of the stack, this is where you get defense-in-depth. So, you're able to handle failures at any level of this stack because, you know, disks fail, CPUs can go out, solar flares can cause memory corruption. Someone can cut a fiber line to a data center, right? And the whole building could potentially go offline.
Floods and fires happen. I mean, they're rare, but there are lots of potential failure modes that you have to consider, and they happen at many of these levels. So, being aware of all those and having mitigations ready for them is super important.
Brian: It's funny, we're calling them virtual machines in the cloud but it's really virtual machines and virtual disks and virtual load balancers and...
Steve: That's right. I liked what you said in the previous episode that you're getting a slice of a data center because we're able to take all these parts and present them to a customer as a nice abstraction. The customer sees this computer with this disk which has this network. And in actuality, that one disk you have is three disks, but we're not showing you all three disks. It's just like magically giving this resilience and the redundancy too.
Carter: Yeah. I see how abstracting away lower layers and saying, "I don't care which zone is under this region as long as one of them is up," makes sense. That would be hard for one supercomputer, but it's very feasible for many smaller machines.
But then, you have to start managing these separate zones and VMs and all this extra complexity that comes with it, which is one of the downsides of abstraction, sometimes. How does Google handle that?
Steve: Good question. I work with a lot of companies that ask exactly this. They're like, "Great. You want us to spread all of our computers around the planet. Yeah, that doesn't sound like a pain in the butt at all."
You know, it sounds like, well, how in the heck do you expect me to do, like deal with that? And again, the answer comes back to abstraction.
So, when you think about it like you have one VM, and the first step you're going to get is many VMs, right? So, we put these together, and we put them behind a load balancer. We call this a service. If you're familiar with Kubernetes, it's pretty much the same idea where you have a bunch of VMs, and you have an entry point that talks to all of them. If you can take this service and run it across multiple zones, we can call it a regional service because it now lives across many of these zones in one region.
And then, if you can get that regional service in many regions, I call this a distributed service. Like there's no canonical name for that right now. But the idea is that it's many of these services, but they all do the same thing and represent the same business need. And so, you're able to, now, handle regional-level events as well.
Carter: Okay. Okay. I'm putting all this together. It's still hard for me to believe these components can go down so often yet somehow give you more reliability because you're layering on top of each other. Maybe, Brian, you're always good at giving me concrete examples. Where do we use this in Google Cloud?
Brian: The most common example is when you have multiple VMs doing the same job, maybe they're a web server or something like that, and you put a load balancer in front of them. And like this shows up in physical installations and it shows up in the cloud, it's maybe the most common example of this.
So you have one endpoint, and you have multiple different backends that can do the work. And if one of them goes away, the work still happens. So, I think that's like my favorite specific example, but Steve, is there a more general principle at work here?
Steve: Yeah, totally. So, as an analogy, think about boats in the water. Like physical boats. If you poke a hole in the bottom of your dingy, you're going to have a bad day, right?
Steve: We call this one failure domain. Your little dinghy is a failure domain because there's only one floor to it. And if you put a hole anywhere in that floor, it's going to flood. It's not great. But, think of a bigger ship, like a container ship or some giant vessel of some kind; they're kind of like a bunch of little boats tied together because they have these things called bulkheads. So, you could poke a hole in the bottom of these big ships, and it would be fine because what it would do is it'd fill up that bulkhead, but the rest of the boat is so buoyant that it stays floating.
So essentially, like you're taking these things that could fail with a hole, and you stick them together. And now, the system, as a whole, doesn't fail even with that same failure mode.
Brian: Right. So, that's what's going on with our VMs and load balancer kind of thing. But how does this work? If you've got VMs and load balancers and zones and regions and other distributed services, how do we reason about this? Like how do we decide how much complexity we're going to put into this?
Steve: Good question. Yeah, that was kind of like the abstract academic answer. "Let's talk about like reality, Steve." It's important to understand what level of availability you can expect from the systems you're building on top of. Like how often do we get a hole in the bottom of a boat? Like can we get a number? That would be nice.
The way that we suggest people think about it is that for something that lives in one zone, we say it's going to be available 99.9% of the time. Specifically, we say it's designed to be available 99.9% of the time. It means it will be down or unavailable for 40 minutes a month. And it's essential to know that's something you can pretty much count on; it's not like the best-case or the worst-case scenario.
Similarly, if you have something across many zones in a region, we call that four nines. It's 99.99% available. And the way to think about that is it's 40 minutes a year of downtime. Computers stay up quite a lot, and forty minutes in an entire year is a tiny fraction of that time. But sometimes, that's still not acceptable, and it comes down to what you're putting on these computers. Like are you okay with those three nines or those four nines of availability? You have to decide what's appropriate for you.
Carter: I get this now. Because basically, you're saying, "Okay, I have two computers, and they're only down 40 minutes every year. If we layer them over each other when one's down, the other one's probably not down, so we're good."
It's interesting to see the added complexity that comes in that you have to manage independently. I wonder if I have to come in as the developer and be like, "Okay, I have to schedule this computer to be up watching for the other one to be down," or is this something that the infrastructure can start to take care of for you?
Steve: Yeah. The good news is: this is a problem we've been dealing with inside Google for the past 20 years. And we've come up with many solutions and improvements to the system that we're bringing to cloud customers. For example, we know that you sometimes have to power down a rack of machines. And if your server is on that rack, we built this thing called live migration, which takes like the soul of your machine -- the running system, the running software -- and transports it magically to another machine. And that way, we can safely power down this rack and bring it back up again. It's one less thing for you to worry about. And when I said before that machines are designed to have 99.9% of availability, it accounts for stuff like this. This is how we're giving you 99.9% availability.
There's a bunch of other stuff too. Sometimes, maybe the machine has a problem, or maybe the building it's in needs maintenance. Who knows? All kinds of things could go wrong, and we aggregate all that risk into one number. And then, we give you that expectation, and you don't have to worry about all those weird things that could go wrong. You just know this machine will be up 99.9% of the time.
Brian: So, by running a VM in a cloud instead of on an individual physical machine, a whole bunch of these kinds of edge cases are handled for you.
You get many tools following this same pattern. For example, something similar is happening at the disk level. We've got load balancers built out of bunches of machines. We've got groups of VMs like managed instance groups, and then, like zones and regions and all this kind of stuff.
But then, at some point, some reliability must be left to the app, right?
Carter: Like you're saying, you want to design an application that can handle this forty minutes of downtime a year too as well? That's interesting.
Brian: Yeah. Okay. So, we get to the point where we can trust the VMs, and then, kind of build from there, I guess.
Steve: Yeah. The important point is to know what level of trust you can put into the VM, and what you're doing is you're putting that trust into the hands of your cloud provider, and you're saying, "Look, just tell me what it's going to be."
On GCP, for our VMs, we tell you it will be 99.9 for one VM in one zone. And then, it just means that you get to put that all out of your mind, and you can trust that number is going to be accurate, and now, you can work around it.
You mustn't think that that number is one hundred percent, even though it looks close enough to a hundred because that's not true. If you make a distinction between 99.9 and a hundred, you're going to make radically different design choices in your application. You're going to start introducing things like retries or checkpointing, or you're going to allow your service to live in two places simultaneously.
Like what if one request comes in here and one comes in there, and like this one's up and this one's not. Like now that you've allowed a little bit of failure into your understanding of the model, you change the way you think about designing your system. And that, to me, one of the fundamental things that made Google succeed from the very beginning was that we designed our systems to allow some form of failure. And then, we just added that resiliency over and over and over again throughout the stack.
Carter: The way you just worded that finally made it click for me. By planning for failure, I'll have the case covered no matter what the failure is. And there might be some delay in getting this information to me somehow, so I'll retry sending it. And that allows you to build more resilience and reliability into your system without knowing the exact failure.
There are important cases in the real world where this matters, which is why there's so much effort going into this thought.
Steve: That's right. It's important not to try to get the highest level of reliability out of absolutely everything. Some websites or services or whatever can be down, and it's no big deal. But other services are essential and need higher reliability.
So, let's say there's a service that responds when someone needs to get a new delivery of oxygen tanks to some medical center. We should probably make sure that that one works most of the time. And even if you make a request and like, it doesn't work right away, but we know that once we put it in, it was received, and someone will take action on it or something like that.
This is just a silly example, but more and more of these services are becoming more and more important to the world as we're just putting more of our society online. I think COVID is actually an interesting time to go through this whole reliability effort because we saw a lot of services become more and more important to individuals in an accelerated way.
So think about video conferencing for school kids or just ordering things from home. Doing this online is becoming more important. So, making sure that you can do it when you want to do it is more of a big deal even than it used to be.
Brian: Yeah. Things are getting very real, and as more of our life goes online, it becomes more critical. Okay, this has been amazing. I think we're going to have to wrap up this one, but I feel like there's much more to talk about here. So, Steve, would you be up for coming back and talking specifically about, when you decide, you know, what questions you ask you about how-to, whether you're going to try to get another nine, and then, how do you actually do it?
Steve: Yeah. I think I can make it, it's fine. I think this is an important thing, like people really want to hear about this stuff. So, I'd love to come back.
Carter: Well, thank you so much, Steve. Brian, thank you so much. If you're listening at home, please, write in the comments, and let us know if there's anything you've learned about reliability. I know I learned a few things; maybe you did too. Thank you.
Special thanks to Steve McGhee, Reliability Advocate at Google, for being this episode's guest!