Rethinking your VM strategy with Spot VMs
Enterprise SI Partner Engineering Team
Organizations globally choose Google Cloud as their transformation partner to accelerate their business and digital transformation because of our leadership in sustainability, AI/ML, data analytics, and more. We are also committed to making cost optimization simple, so we offer a suite of services and tools for customers to effortlessly optimize their environments.
“Media.net chose Spot VMs after exploring various options to support spiky workloads, as they provided Media.net with both deep discounts and simple, predictable pricing.” — Amit Bhawani, Sr VP of Engineering, Media.net
Today, we’ll dive deeper into the use cases and best practices for provisioning and managing Spot VMs to help you save up to 91% off your compute costs.
Spot VMs, previously known as preemptible VMs, are ideal for fault-tolerant workloads and offer the same performance as on-demand VMs. You are guaranteed 60% - 91% off on-demand VM pricing, including GPU, local SSD, and IP addresses that are attached to the VM. Prices vary by region and machine type.
Let's now look at top use cases and workloads that work well with Spot VMs.
Spot VMs are great for batch computing, HPC workloads, training ML models, and stateless web applications. Containerized workloads that can handle instance failure/termination are a great fit too. Spot is integrated with Google Kubernetes Engine (GKE), GKE Autopilot, Batch, Dataproc, and Dataflow VMs.
Because Spot VMs can be preempted (or interrupted), it is recommended to use Spot for fault-tolerant workloads such as rendering, genomic processing, and financial modeling.
Conversely, workloads with high uptime needs, such as stateful and fault-intolerant workloads, are not a great fit. Please check out our blog here for a deep dive into the use cases and best practices of using Spot VMs.
Simplified and predictable pricing: Spot VMs offer a minimum of 60% and up to 91% off compute costs, with predictable pricing that changes up to once a month, allowing you to better forecast costs and avoid runaway costs. To see prices, you can look it up manually on the VM instance pricing page or query using the Cloud Billing Catalog API.
No time limits: Spot VMs run indefinitely until Compute Engine needs to reclaim resources.
Spot deployment overview
You can deploy one or multiple MIGs to support each pool of Spot VM resources you want to scale and manage. This is ideal for workloads that don’t require a minimum set of resources to run.
In contrast, we offer a fully managed batch offering that integrates with Spot, called Google Batch. There is no additional cost of using Batch, and it lets you create and run jobs that each automatically provision and utilize the resources required to execute its tasks. Let's now look at the different methods of creating and managing Spot VMs.
Maintain and automate your Spot VMs with Managed Instance Groups:
Managed Instance Groups (MIGs) offer customers a way to ensure that their VM group can meet the demands of their application and customers. Managed instance groups operate like other managed services and features by allowing the cloud to step in and take some actions automatically, reducing the manual work and management burden on your team. MIGs handle rolling updates, blue/green deployments, the instance group can scale out or in automatically with a configurable metric. When used with Spot VM’s, the MIG will provide the same benefits while deploying Spot VMs when scaling out or replacing VMs lost due to preemption. If Spot VMs are not available then the MIG will persist in requesting the additional Spot VMs until the capacity becomes available and filled. Please note that MIGs will not prevent an outage if all of the Spot VMs are preempted; however, when Spot VMs become available again, the MIG will bring new instances online without manual work.
Create and use Spot VMs
Now that we have a better understanding of Spot VMs and their respective use cases, let’s walk through how to create and manage them, including the following:
How to start and identify Spot VMs
Various ways to create Spot VM’s
Spot VM’s with Google Kubernetes Engine (GKE)
Best practices for Spot VMs
Like other VMs, Spot VMs require available CPU quotas. If you use Spot VMs with these resources and have not requested preemptible quota, Spot VMs will consume your standard quota. If you plan to use Spot VMs, consider requesting preemptible quota for those resources as Step 1 to prevent Spot VMs from consuming your quotas.
Spot VMs can be created in a number of ways by using the console, gcloud CLI, the Compute Engine API, or Terraform. A Spot VM is any VM that is configured to use the spot provisioning model.
In the Google Cloud console, go to the Create an instance page.
Expand the Networking, disks, security, management, sole tenancy section, and do the following:
Expand the Management section.
In the Availability policies section, select Spot from the VM provisioning model list. This setting disables automatic restart and host maintenance options for the VM and enables the termination action option.
Optional: In the On VM termination list, select what happens when Compute Engine preempts the VM:
To stop the VM during preemption, select Stop (default).
To delete the VM during preemption, select Delete.
Optional: Specify other VM options. For more information, see Creating and starting a VM instance.
To create and start the VM, click Create.
To create a VM from the gcloud CLI, use the gcloud compute instances create command. To create Spot VMs, you must include the --provisioning-model=SPOT flag. Optionally, you can also specify a termination action for Spot VMs by also including the --instance-termination-action flag.
Replace the following:
VM_NAME: name of the new VM.
TERMINATION_ACTION: Optional: specify which action to take when Compute Engine preempts the VM, either STOP (default behavior) or DELETE.
To create multiple Spot VMs with the same properties, you can create an instance template, and use the template to create a managed instance group (MIG).
Compute Engine API
To create a VM from the Compute Engine API, use the instances.insert method. You must specify a machine type and name for the VM. Optionally, you can also specify an image for the boot disk.
To create Spot VMs, you must include the "provisioningModel": spot field. Optionally, you can also specify a termination action for Spot VMs by also including the "instanceTerminationAction" field.
For more information about the options you can specify when creating a VM, see Creating and starting a VM instance.
You can use a Terraform resource to create a spot instance using scheduling block.
Spot VMs with Google Kubernetes Engine (GKE) and Autopilot Clusters
When you create a cluster or node pool with Spot VMs, GKE creates underlying Compute Engine Spot VMs that behave like a managed instance group (MIG). Nodes that use Spot VMs behave like standard GKE nodes but with no guarantee of availability. When the resources used by Spot VMs are required to run standard VMs, Compute Engine terminates those Spot VMs to use the resources elsewhere. This section shows you how to run fault-tolerant, stateless, or batch workloads at lower costs by using Spot VMs and Spot Pods in your GKE clusters and node pools.
Before you begin, ensure the Google Kubernetes API is enabled.
Instructions to create a cluster or node pool can be found in the following section of the published GKE documentation.
Instructions to create Spot Pods for GKE Autopilot clusters can be found in the following section of the published GKE documentation.
Start and stop Spot VMs
Like other VMs, Spot VMs start upon creation. Likewise, if Spot VMs are stopped, you can restart the VMs to resume the RUNNING state. You can stop and restart preempted Spot VMs as many times as you would like, as long as there is capacity. For more information, see VM instance life cycle.
If Compute Engine stops one or more Spot VMs in an autoscaling managed instance group (MIG) or Google Kubernetes Engine (GKE) cluster, the group restarts the VMs when the resources become available again.
Here are some best practices to help you get the most out of Spot VMs.
Use instance templates. Rather than creating Spot VMs one at a time, you can use instance templates to create multiple Spot VMs with the same properties. Instance templates are required for using MIGs. Alternatively, you can also create multiple Spot VMs using the bulk instance API.
Use MIGs to regionally distribute and automatically recreate Spot VMs. Use MIGs to make workloads on Spot VMs more flexible and resilient. For example, use regional MIGs to distribute VMs across multiple zones, which helps mitigate resource-availability errors. Additionally, use autohealing to automatically recreate Spot VMs after they are preempted.
Pick smaller machine types. Resources for Spot VMs come out of excess and backup Google Cloud capacity. Capacity for Spot VMs is often easier to get for smaller machine types, meaning machine types with less resources like vCPUs and memory. You might find more capacity for Spot VMs by selecting a smaller custom machine type, but capacity is even more likely for smaller predefined machine types. For example, compared to capacity for the n2-standard-32 predefined machine type, capacity for the n2-custom-24-96 custom machine type is more likely, but capacity for the n2-standard-16 predefined machine type is even more likely. Please note non compute services like persistent disk and networking are not eligible for Spot VM discounts today.
Run large clusters of Spot VMs during off peak times. The load on Google Cloud data centers varies with location and time of day, but generally lowest on nights and weekends. As such, nights and weekends are the best times to run large clusters of Spot VMs.
Design your applications to be fault and preemption tolerant. It's important to be prepared for the fact that there are changes in preemption patterns at different points in time. For example, if a zone suffers a partial outage, large numbers of Spot VMs could be preempted to make room for standard VMs that need to be moved as part of the recovery. In that small window of time, the preemption rate would look very different than on any other day. If your application assumes that preemptions are always done in small groups, you might not be prepared for such an event. You can test your application's behavior under a preemption event by stopping the VM.
Use shutdown scripts. Manage shutdown and preemption notices with a shutdown script that can save a job's progress so that it can pick up where it left off, rather than start over from scratch.
To learn more about Spot VMs, please check out the Spot VM documentation here.