Cost optimization using automated VM management

This document describes ways to automatically manage virtual machine (VM) instances on Google Cloud, and it recommends best practices for different use cases. This document is intended for operations personnel and administrators who are responsible for maintenance and cost control of existing cloud infrastructure, and for developers who are migrating existing workflows into the cloud.

A primary benefit of the cloud is that you only pay for the compute resources you use. For example, Compute Engine VMs are charged per second. Production systems commonly run constantly, but some VMs—like those in development or test environments—are typically used only during business hours. Other VMs that are used for batch processing aren't needed after they complete their tasks. Keeping VMs running when they have no users or tasks to run serves no useful purpose, so turning them off saves money. Systems that are migrated from on-premises hardware don't usually take advantage of the ability to turn off when they don't have users or tasks to run. Managing fleets of VMs manually is tedious, error-prone, and hard to enforce across a large organization.

Google Cloud offers many ways to optimize and automatically manage VM instances, from simple time-based instance schedules to tools like Cloud Composer that can orchestrate complex workflows across many products and clouds. This document introduces options to help you decide which workflow is best for your use case.

Choose an automation workflow

The following diagram provides a decision tree that can help you to identify the most suitable automation workflow for your use case.

A decision tree that helps you pick an automated workflow.

The preceding diagram outlines the following steps:

  1. Are the VMs used by people, such as for remote workstations, or for batch or event jobs?
    1. For people, see Automate instance scheduling later in this document. Automated instance scheduling uses Cloud Scheduler and Cloud Functions.
    2. For batch or event jobs, proceed to the next step.
  2. Do your jobs involve orchestrating work across many products or clouds?
    1. If yes, see Orchestrating complex workflows later in this document. Complex workflows use Workflows and Cloud Composer.
    2. If no, proceed to the next step.
  3. Are your jobs too large to fit within one VM instance?
    1. If yes, see Processing large volumes of data later in this document. To process large volumes of data, you use Dataflow.
    2. If no, see Running jobs on demand later in this document. To run jobs on demand, you use Cloud Scheduler and Cloud Functions.

Automate instance scheduling

Scheduling is useful for VM instances that are only required during certain times of the day or week. Common use cases include virtual desktops and development or testing infrastructure that is used only during working hours.

Before you set up scheduling, consider the following to determine if scheduling is the best option for your organization and its users:

  • Compute Engine offers committed use discounts, which can provide substantial cost savings for most resources. If your organization already has a committed use contract that's large enough to cover all your VM usage, scheduling doesn't provide any further cost reduction.
  • Compute Engine offers automatic sustained use discounts for any instances that run for a significant portion of a billing month. The highest discount percentage is up to 30% if the instance runs over 75% of the month. Although scheduling can still reduce costs, the cost advantage is reduced accordingly.
  • Automated scheduling requires predictable hours. If your users' working hours are unpredictable, consider using an automated cleanup solution that can turn off instances after a certain period instead.

Stopping, suspending, and deleting instances

In Google Cloud, a fully operational instance is in the running state. You can effectively turn off a running instance by either suspending, stopping, or deleting it. The following table shows how each state affects billing and whether the instance components are retained:

Instance state vCPU Memory Disk Billing
Running Yes Yes Yes vCPU, memory, and disk
Suspended No Yes Yes Instance memory and disk
Stopped No No Yes Disk only
Deleted No No No None

Suspending an instance

When an instance is suspended, it is no longer available for use, but it retains both memory and disk state. You aren't billed for vCPUs, but you are billed for any suspended instance memory and device state and any attached disks.

Consider suspending instances if it is important that users can quickly resume their work from where they ended the previous day. Common use cases for suspension include virtual desktops and developer workstations. For more information, see Suspending and resuming an instance.

Stopping an instance

When an instance is stopped, attached disks are retained, but any in-memory contents are lost. While an instance is stopped, you are only billed for its disks. Starting a stopped instance is equivalent to rebooting it from the attached disk, including running any startup scripts.

Consider stopping instances if retaining memory state and resuming rapidly aren't requirements. Common use cases for stopping include test environments and CI/CD pipelines. For information about setting up a schedule, see Scheduling a VM instance to start and stop.

Deleting an instance

An instance that is deleted no longer exists and cannot be restored. By default, attached boot disks are also deleted automatically, although you can change the auto-delete setting.

In order to easily recreate an instance later, you can save the full configuration of an instance, including all disks, as a machine image. You can save the content of a single disk by using snapshots. Storage charges apply to both machine images and snapshots.

If you need to restore both memory and disk to a known state each time, consider deleting and recreating instances. A common use case includes large fleets of instances that are replicated from a single template, such as classroom lab environments.

Automate batch jobs

Batch jobs are automated tasks that run either on demand or on a set schedule. After the job is finished, its resources are no longer needed and can be deleted.

Running jobs on demand

Some batch jobs are run on demand. For example, a user might request a time-consuming report that is prepared in the background and emailed to them.

A lightweight solution for running jobs on demand is to use Cloud Functions. Cloud Functions lets developers create single-purpose, standalone functions that respond to events without the need to manage an instance. Cloud Functions pricing is based on usage, so it is cost-effective. The following diagram shows how Cloud Functions works with Compute Engine:

Cloud Functions triggers Compute Engine jobs on demand.

In the preceding diagram, an application invokes a Cloud Function by calling an HTTP trigger URL. An application can call a trigger URL based on user input—for example, if a user requests a report in a frontend web application, that application can call the trigger URL to start preparing the report.

Cloud Functions can also be triggered by events, such as uploading a file to a Cloud Storage bucket. For an example workflow, see Automating the classification of data uploaded to Cloud Storage.

A Cloud Function can call the Compute Engine API to create a temporary instance, and the instance can then execute shell scripts or run programs. To make the instance delete itself after it completes its work, you can use the gcloud command-line tool. For an example, see Create a self-deleting virtual machine on Compute Engine.

For more detailed examples, see the Cloud Functions tutorials or Using Cloud Scheduler and Cloud Functions to deploy a periodic Compute Engine VM worker.

Scheduling jobs

In an on-premises environment, cron jobs are the standard way to run scheduled tasks. However, running a cron scheduler requires both that the instance is continually available (which might not be the case), and that the instance is continually running, even when not performing work. Using cron as-is in the cloud can be unreliable and inefficient. Instead, we recommend that you create a distributed cron system, using Cloud Scheduler with Pub/Sub to send messages to Compute Engine instances. For an example of this pattern, see Reliable task scheduling on Compute Engine with Cloud Scheduler.

Processing large volumes of data

Some batch jobs process volumes of data that are too large to handle with a single VM in a reasonable amount of time. Dataflow offers an Apache Beam-compatible framework for running both batch and stream processing jobs across a fleet of VMs, automatically creating and deleting the resources needed. To get an extra discount from regular pricing, you can use Flexible Resource Scheduling to schedule batch jobs that run during a six hour window.

Orchestrating complex workflows

Some jobs require additional resources along with VM instances, such as databases, temporary storage, or extract, transform, load (ETL) pipelines.

You can use Workflows to create lightweight workflows to process events or to orchestrate microservices. Workflows works well with Cloud Functions, Cloud Run, or any other APIs (within or outside Google Cloud) that are network-reachable and support HTTP. Workflows is serverless and can scale to zero, making it suitable for unpredictable workloads that respond to sudden demand increases with low latency. For examples, see Loading data from Cloud Storage to BigQuery using Workflows and Create a custom machine learning pipeline with Workflows and serverless services.

For batch orchestration workflows like data engineering or ETL, Google Cloud offers a managed version of Apache Airflow called Cloud Composer. Cloud Composer workflows are modeled as directed acyclic graphs running on a scalable, always-on cluster, and can use a wide range of Airflow operators across other products or clouds. For an example solution that takes snapshots of instances for backup purposes, see Automating infrastructure with Cloud Composer.

Reduce unused capacity

Building infrastructure for future use requires you to estimate demand. When you estimate demand, sometimes VMs are oversized for their job or they are entirely unused. These issues are prevalent when migrating on-premises infrastructure with minimal changes to the architecture (sometimes called a lift and shift migration). Although physical hardware investments cannot be easily rolled back, in the Cloud you can reduce cost easily by resizing or deleting your instances.

As part of Active Assist, Google provides a Recommendations Hub that displays both security and efficiency recommendations for your project. Two key recommendation types are reducing VM cost and identifying unused resources.

Reducing VM cost

Compute Engine instances are charged per vCPU and memory size, so it's inefficient to run a larger instance than necessary for the actual workload.

Based on VM utilization during the preceding five days, Google Cloud provides machine sizing recommendations both in the Recommendations Hub and in the Cloud Console Compute Engine page. These recommendations are also available through the gcloud recommender recommendations command and the Recommender API.

Other options for reducing cost include the following:

You can change the machine type of the instance to resize it to a smaller or cheaper instance. To change an instance machine type, you must first stop the instance.

Identifying idle instances

Sometimes instances are created for tasks without a clearly defined endpoint. For example, users might create test instances to test a specific feature or version, and then forget to shut down the instances when they're finished. The following diagram shows how you can use Cloud Scheduler with Cloud Functions and the Recommender API to identify and delete idle VMs:

Recommender recommendations help to identify and delete idle instances.

To automatically identify idle instances, Recommender evaluates the CPU and network usage of Compute Engine instances over a period of time. This data is used to create idle VM recommendations that are available from the Recommender API. In the preceding diagram, Cloud Scheduler triggers a function that gets the idle VM recommendation and then deletes idle instances. To help reduce costs, you can manage idle Compute Engine instances automatically to label idle instances for review and optionally stop or delete them.

For information about how to check associated resources like IP addresses and persistent disks, see Automating cost optimizations with Cloud Functions, Cloud Scheduler, and Cloud Monitoring.

Cleaning up expired instances

You can use Cloud Scheduler and Cloud Functions to set up a garbage collection function, as shown in the following diagram. The function automatically deletes selected VMs that are running for longer than the time period that you configure.

Cloud Scheduler helps to clean up expired instances.

In the preceding diagram, Cloud Scheduler sends a message to Pub/Sub, which triggers Cloud Functions to delete expired VMs. For an example of this pattern, see Cleaning up Compute Engine instances at scale.

What's next