Compute

Top 5 use cases for Google Cloud Spot VMs explained + best practices

June 22, 2022

Paul Brouwers

Customer Engineer, Google Cloud

Stefan Salandy

Customer Engineer - Google Cloud

Cloud was built on the premise of flexible infrastructure that grows and shrinks with your application demands. Applications that can take advantage of this elastic infrastructure and scale horizontally with the demands of your application offer significant advantages over competitors by allowing infrastructure costs to scale up and down along with the demand.

Google Cloud’s Spot VMs enable our customers to make the most of our idle capacity where and when it is available. Spot VMs are offered at a significant discount from list price to drive maximum savings provided customers have flexible, stateless workloads that can handle preemption. Spot VMs can be reclaimed by Google (with a 30 second notice). When you deploy the right workloads on Spot VMs, you are able to maintain elasticity while also taking advantage of the best discounts Google has to offer.

This blog discusses a few common use cases and design patterns we have seen customers utilize Spot VMs for and discusses the best practices for these use cases. While this is not an exhaustive list, this blog serves as a template to help customers make the most of the Spot VM savings while still reaching their application and workload objectives.

Media rendering

Rendering workloads (such as rendering 2D or 3D elements) can be both compute and time intensive, requiring skilled IT resources to manage render farms. Job management becomes even more difficult when the render farm is at 100% utilization. Spot VMs are ideal resources for fault-tolerant rendering workloads; when combined with a queuing system customers can integrate the preemption notice to track preempted jobs. This allows you to build a render farm which benefits from reduced TCO. If your renderer supports taking snapshots of in-progress renders at specified intervals, writing these snapshots to a persistent data store (Cloud Storage) will limit any loss in work in the event the Spot VM is preempted. As subsequent Spot VMs are created, they can pick up where the old ones left off by using the snapshots on Cloud Storage. You can also leverage the new “suspend and resume a VM” feature which allows you to keep the VM instances during the preemption event but not incur any charges for it while the VM is not in use.

Additionally, we have helped customers combine local render farms in their existing datacenters with cloud-based render farms, allowing a hybrid approach for large or numerous render workloads without increasing their investment in their physical datacenters. Not only does this reduce their capital expenses, but it adds flexible scalability to the existing farm and provides a better experience for their business partners.

Financial modeling

Capital market firms have significant investments in their infrastructure to create state-of-the-art, world-class compute grids. Since compute grids began, in-house researchers leverage these large grids in physical datacenters to test their trading hypotheses and perform backtesting. But as the business grows, what happens when all the researchers each have a brilliant idea and want to test that out at the same time? Researchers then have to compete with one another for the same limited resources, which leads to queueing their jobs and increased lead times for testing their ideas. And in financial markets, time is always scarce. Enter cloud computing and Spot VMs. Capital market firms can use Google Cloud as an extension of their on-premises grid by spinning up temporary compute resources. Or they can go all in on cloud and build their grid in Google Cloud entirely. In either scenario, Spot VMs are ideal candidates for bursting research workloads given the transient nature of the workload and heavily discounted prices of VMs. This enables researchers to test more hypotheses at a lower cost per test, in turn producing better models for firms. Google Cloud Spot VM discounts not only apply to the VMs themselves, but also to any GPU accelerator attached to them, providing even more processing power to a firm looking to process larger more complex models. Once these jobs have completed, Spot VMs can be quickly spun down, maintaining strict control on costs.

CI/CD pipelines

Continuous integration (CI) and Continuous delivery (CD) tools are very common for the modern application developer. These tools allow developers to create a testing pipeline that enables developers and quality engineers to ensure the newly created code works with their environment and that the deployment process does not break anything during deployment. CI/CD tools and test environments are great workloads to run on Spot VMs since CI/CD pipelines are not mission-critical for most companies — a delay in deployment or testing by 15 minutes, or even a few hours, is not material to their business. This means that companies can lower the cost of operating their CI/CD pipeline significantly through the use of Spot VMs.

A simple example of this would be to install the Jenkins Master Server in a Managed Instance Group (MIG) with the VM type set to Spot. If the VM gets preempted, the CI/CD pipelines will stall until the MIG can find resources again to spin up a new VM. The first reaction may be concern that Jenkins persists data locally, which is problematic for Spot VMs. However, customers can move the Jenkins directory (/var/lib/Jenkins) to Google Cloud Filestore and preserve this data. Then when the new Spot VM spins up, it will reconnect to the directory. In the case of a large-scale Jenkins deployment, build VMs can utilize Spot VMs as part of a MIG to scale as necessary while ensuring that the builds can be maintained with on-demand VMs. This blended approach removes any risk to the builds, while still allowing customers to save up to 91% in costs of the additional VMs versus traditional on-demand VMs.

Web services and apps

Large online retailers have found ways to drive massive increases in order volume. Typically companies like this target a specific time each month, such as the last day of the month, through a unique promotion process. This means that they are in many cases creating a Black Friday/Cyber Monday-style event, each and every month! In order to support this, companies traditionally used a “Build it like a stadium for Super Bowl Sunday” model. The issue with that, and a reason most professional sports teams have practice facilities, is that it’s very expensive to keep all the lights, climate control, and ancillary equipment running for the sole purpose of practice. 29-30 days of a month most infrastructure sits idle, wasting HVAC, electricity, etc. However, using the elasticity of cloud, we could manage this capacity and turn it up only when necessary. But to drive even more optimization and savings, we turn to Spot VMs.

Spot VMs really shine during these kinds of scale-out events. Imagine the above scenario: what if behind a load balancer we could have:

One MIG to help scale the web frontends. This MIG will be sized with on-demand VMs to handle day-to-day traffic.
A second MIG for Spot VMs that scales up starting at 11:45pm the night prior to the end of month. The first and second MIG can now handle ~80-90% of the workload.
A third MIG of on-demand VMs that spins up as a workload bursts to handle any remaining traffic, should the Spot MIG not be able to find enough capacity, thus ensuring we’re meeting our SLAs as well as keeping costs as tight as possible.

Kubernetes

Now you may say “Well that’s all well and good, but we’re a fully modernized container shop, using Google Kubernetes Engine (GKE).” You are in luck — Spot VMs are integrated with GKE, enabling you to quickly and easily save on your GKE workloads by using Spot VMs with standard GKE clusters or Spot Pods with your Autopilot clusters. GKE supports gracefully shutting down Spot VMs, notifying your workloads that they will be shut down and giving them time to cleanly exit. GKE then automatically reschedules your deployments. With Spot Pods, you can use Kubernetes nodeSelectors and/or Node affinity to control the placement of spot workloads, striking the right balance between cost and availability across spot and on-demand compute.

General best practices

To take advantage of Spot VMs, your use case doesn’t have to be an exact match to any of those described above. If the workload is stateless, scalable, can be stopped and checkpointed in less than 30 seconds, or is location- and hardware-flexible, then they may be a good fit for Spot VMs.

There are many several actions you can take to help ensure your Spot workloads run as smoothly as possible. Below we outline a few best practices you should consider:

1. Deploy Spot behind Regional Managed Instance Groups (RMIGs):

RMIGs are a great fit for Spot workloads given the RMIG’s ability to recreate instances which are preempted.
Using your workload’s profile, determine the RMIG’s target distribution shape. For example, with a batch research workload, you might select an ANY target distribution shape. This will allow for Spot instances to be distributed in any manner across the various zones, thereby taking advantage of any underutilized resources.
You can use a mix of on-demand RMIGs and Spot RMIGs to maintain stateful applications while increasing availability in a cost effective manner.

2. Ensure you have a shutdown script:

In the event of Spot VM preemptions, use a shutdown script to enable checkpointing to Cloud Storage for your workloads as well as perform any graceful shutdown processes.
When drafting your shutdown script, test it out on an instance by either manually stopping or deleting the instance with the shutdown script attached and validate the intended behavior.

3. Write check-point files to Cloud Storage.

4. Consider using multiple MIGs behind your load balancer.

Whether your workload is graphics rendering, financial modeling, scaled-out ecommerce, or any other stateless use case, Spot VMs are the best and easiest way to reduce your cost of operating it by more than 60%. By following the examples and best practices above, you can ensure that Spot VMs will create the right outcome. Get started today with a free trial of Google Cloud.

^{Acknowledgement
Special thanks to Dan Sheppard, Product Manager for Cloud Compute, for contributing to this post.}

Posted in

Cost Management

Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs

By Alfonso Hernandez • 5-minute read

Serverless

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

By James Ma • 3-minute read

Compute

Unlock 2x better price-performance with Axion-based N4A VMs, now generally available

By Nate Baum • 6-minute read

Compute

Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo

By Sean Horgan • 9-minute read

Top 5 use cases for Google Cloud Spot VMs explained + best practices

Paul Brouwers

Stefan Salandy

Media rendering

Financial modeling

CI/CD pipelines

Web services and apps

Kubernetes

General best practices

Related articles

Simpler billing, clearer savings: A FinOps guide to updated spend-based CUDs

High-performance inference meets serverless compute with NVIDIA RTX PRO 6000 on Cloud Run

Unlock 2x better price-performance with Axion-based N4A VMs, now generally available

Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo