Developers & Practitioners

Is there a limit to Cloud VMs? A conversation

We published another episode of “VM End to End,” which is a series of curated conversations between a “VM skeptic” and a “VM enthusiast”. Every episode, join Brian, Carter, and a special guest as they explore why VMs are some of Google’s most trusted and reliable offerings, and how VMs benefit companies operating at scale in the cloud. Here is a transcript of the episode:


Carter Morgan: Welcome to another episode of VM End to End, a show where you have someone a little skeptical of VMs and someone a little more enthusiastic about them on the show to hash out all things VM. Brian, thanks so much for being here today; I've got a tricky one for you.

Brian Dorsey: Okay, let's go.

Carter: Yeah, yeah. I want to know about cutting-edge technology, really pushing Cloud Compute machines to the limit. I think this one will trip us up.

Brian: I have the perfect person for us to bring in and talk to about this. I want to introduce Emma. Welcome, Emma.

Emma Haruka Iwao: Hi. Thank you for having me today.

Carter: Yeah, so happy to be able to talk with you. If I'm not mistaken, you made an amazing world record-breaking pi calculator using Google Cloud and Compute, yes?

Emma: Yes. It was in 2019. So, it's a little bit old, but it was a world record.

Carter: Yes. Can we get an overview of that project a little bit? What was it?

Emma: Sure. So, there are competitions around pi, the number beginning with 3.14. People try to calculate as many digits as possible. We recorded three, no, 31.4 trillion digits on Google Cloud, using 25 machines. It took 121 days and a hundred terabytes of storage. We did that.

Brian: This scale is why I wanted to invite Emma in because there are not many people who have run machines full-on for four months straight. And I think, more interestingly, worked through where the bottlenecks are in a system when you try to do that. Where were the bottlenecks in this process?

Emma: Sure. The bottleneck was the storage. To calculate pi, you need a lot of storage, 107 terabytes. But, the amount you read and write is enormous. We wrote about nine petabytes, sorry, eight petabytes, and read nine petabytes. A petabyte, that's a thousand terabytes. So, it's approximately 17,000 terabytes of data you need to process. The storage IO was the biggest bottleneck.

Carter: This is mind-breaking to me because I would not expect Cloud Compute to be able to handle that. That was my assumption coming into this. Was this one of the first projects of its type to do this, or is this commonplace in the Cloud?

Emma: I think it's getting more and more common these days. Many people just run high-performance computing, this type of massive workloads on Cloud.

Carter: Then, my question would be, what are some of the advantages of doing this in the Cloud? Because I think I understand what this looks like on-prem. In that case, I'd buy a lot of my machines upfront and try to work through it from there. What does it look like in the Cloud?

Emma: I think the coolest part of using Cloud is that you don't need to choose which machine or architecture to use upfront. For example, suppose you are buying a physical machine. In that case, you need to decide how many cores you're getting, how much storage you are assigning to that machine, et cetera, et cetera. But, with Cloud, you can always try and test different parameters. For example, do I need a hundred gigabytes of memory or 200 gigabytes? Or, do I need 64 cores, 128 cores? You can try and test these parameters in the real environment.

Brian: You basically do small versions of it and see how much work you get for a certain amount of time and money and see how that turns out?

Carter: Wow. This is cool. Something I'm curious about is, okay, you said this was in 2019, and it's not the current record now or whatever. But, what would be different if you were coding the same thing? Say you were trying to do 32 trillion digits of pi again. What would change if you took that same code base now and just, I don't know, could you update it for some of the Compute technology now, or what?

Emma: Yeah. With Cloud, you can just change the parameters and use the latest CPUs and machines available today. For example, we have the latest Intel and AMD processors. We have more memory per machine. For example, the biggest machine we have is 11 terabytes of memory. We have newer persistent disc types that achieve lower cost and higher throughput. We have increased the network bandwidth from 16 units per second to a hundred gigabytes per second. So, by using all of these, I think we can probably finish the calculation in a quarter of the time we did in 2019.

Brian: There's a kind of a saying, time is money. At some level, that's just even more true in the cloud, right? If you make something faster, it costs less to run, and that's a big deal.

Carter: Can we talk about that a little bit? What was, say, the cost of this in 2019? Then, you said it's four times faster? What would the cost of it look like now? Just estimates.

Emma: Sure. If you don't work for Google, you probably pay a lot for this project. It was about $300,000 to calculate pi if you paid the money as an external customer. You get half the storage for half the price today, achieve the same bandwidth. So, that cut the cost by 30%, because the biggest cost was the storage.

Emma: We have a faster processor; we can use the Intel isolated Xeon processor and increase the network bandwidth. In total, it took about four months to finish our calculation. Today, I think we can do something in probably 40-50 days. You pay less per hour, and you finish the calculation faster. So, by combining all of these factors, I think it's safe to say we pay much, much less today.

Brian: That's awesome. The same project over time gets cheaper and cheaper. That seems like a win, but it seems like there are a lot of variations for how things could be done. You've talked about running experiments. If people wanted to run their own experiments on their own software, how would you recommend approaching that?

Emma: Sure. There are tons of configurations and variables in the cloud and probably with your software you want to run in the cloud. Of course, you can automate some of the aspects of your software. There is a tool to automate that for the cloud as well. It's called infrastructure as a code, and we use tools like Terraform and Boomi to write down all the configuration parameters as a script and run it.

Emma: The coolest part is you can automate the deployment and provisioning process. For example, if you want to test with different CPU cores, like 2, 4, 8, 16, 32, 64, you just need to write a for loop and run the script again and again. Involve the software and run the test for all these combinations. That's what you can do with the cloud.

Carter: Wow. Yeah, I can see that. Again, I'm still a little skeptical. But I see the benefit of being able to just say, "Let me try out all these different configurations for my workload." I'm curious, what are some other things that you think are pretty cool about this type of hardware that might not be obvious to someone like me, who doesn't do this kind of stuff as deeply as you do?

Emma: Right, it's good. You may think, "Hey, I can get the same hardware from a hardware store or hardware vendor." But it's actually not. Well, it's complicated because you buy hardware. You buy a new machine and use it for three or four years, right? You don't buy a new server every month. With the cloud, you just need to shut down the machine, change the configuration, and reboot. Then, you get access to the latest hardware without paying any upfront cost.

Emma: So if you want to, say, use the latest steel processor or SSD or MVME or [inaudible 00:08:59], then you just need to change the configuration. Then you keep all the data all the configurations and just get the latest hardware for free without any upfront cost. I think that's pretty cool. You can use the latest hardware anytime on the cloud. Actually, the cloud is one of the earliest places to support such the latest generations of hardware.

Carter: Oh man, that killed my argument because I was just going to be like, "Well, why don't I just buy the newest technology?" But, a lot of times, it doesn't roll out as fast as it would to the cloud.

Brian: You mentioned the networking got a lot faster?

Emma: Yeah. The network is faster. You have the ethernet and the switch and top-of-rack switch and the whole backend. The cool part of cloud is you don't need to think about the actual fabric. We support a hundred gigabits per second for a network, and each machine can communicate to another machine with a hundred gigabits per second, regardless of their physical positions.

Emma: When designing a physical data center, you need to think of top-of-rack bandwidth and how you distribute workload to get the full bandwidth for each machine. But, with cloud, we design the network so that each machine can achieve the full bisection of bandwidth.

Brian: Oh, okay. So, that's a whole thing from an architecture and planning standpoint; You just don't even have to worry about it.

Carter: This is so interesting to me. I'm curious if you were going to do a project like this again and try to go for another record, would you still use cloud computers? Or, do you think you might switch over and use something on-prem, something more dedicated?

Emma: Because I work for Cloud, I would use Cloud. There are obviously pros and cons. For example, if you want to use a specific hardware set, for example, if you want to test a specific piece of hardware, you might still need to get access to that specialty equipment. But for, I think I have to say 99% of the workload in the world, you can just launch a machine in the cloud because you don't know which machine to buy before actually getting the machine.

Emma: So, for on-premise hardware, you do the guesswork, and I think this hardware--with this number of cores and this much memory--is sufficiently big enough but not too big. But, sometimes you don't have enough memory, or you don't have enough storage, or you may need to add some cores. You never know.

Emma: With the cloud, you can just test everything, right? You can always change the configuration, and you can always add new hardware. Yeah, so you don't need to choose and invest at the beginning. You just change and then adapt along the way.

Brian: This is a fascinating angle. People often say if you're doing a batch workload, that may not be the sweet spot. But, there's this upfront testing side of it, where you're being more scientific, and you're trying to figure out what kind of computer, what kind of arrangements of the computers. You can do all of those tests in really short durations, which makes them affordable to do a wide variety of them on cloud. Thank you. That's it. I'm in the middle of this and that's an angle on this I hadn't been thinking of. So, I appreciate that.

Carter: Yeah. Yo, Emma, thank you so much for coming in and sharing what you've been working on. The way you're able to push the Cloud Compute instances to the extreme. There's a lot that I didn't think was possible or didn't know was possible. I hate to admit this, Brian, but the more we talk, the more I start thinking cloud computing makes a lot of sense.

Brian: Imagine.

Carter: Well, yo, that's it for this episode. Emma, thank you. If you're watching at home, definitely let us know of anything surprising to you. We check out all the comments. Stay tuned for the next episode because it's a special one. I don't know what's in store, but Brian said I couldn't miss it.

Brian: Oh, we do. It's going to be awesome. Yeah, so please hang in there, check it out. If you're living in the future already, just hit next on the playlist and join us. It'll be fun!

Special thanks to Emma Haruka Iwao, Developer Advocate at Google, for being this episode's guest!