Optimize cost: Compute, containers, and serverless

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your virtual machines (VMs), containers, and serverless resources in Google Cloud.

The guidance in this section is intended for architects, developers, and administrators who are responsible for provisioning and managing compute resources for workloads in the cloud.

Compute resources are the most important part of your cloud infrastructure. When you migrate your workloads to Google Cloud, a typical first choice is Compute Engine, which enables you to provision and manage VMs efficiently in the cloud. Compute Engine offers a wide range of machine types, and is available globally in all the Google Cloud regions. Compute Engine's predefined and custom machine types let you provision VMs that offer similar compute capacity as your on-premises infrastructure, enabling you to accelerate the migration process. Compute Engine gives you the pricing advantage of paying only for the infrastructure that you use and provides significant savings as you use more compute resources with sustained-use discounts.

In addition to Compute Engine, Google Cloud offers containers and serverless compute services. The serverless approach can be more cost-efficient for new services that aren't always running (for example, APIs, data processing, and event processing).

Along with general recommendations, this document provides guidance to help you optimize the cost of your compute resources when using the following products:

  • Compute Engine
  • Google Kubernetes Engine (GKE)
  • Cloud Run
  • Cloud Functions
  • App Engine

General recommendations

The following recommendations are applicable to all the compute, containers, and serverless services in Google Cloud that are discussed in this section.

Track usage and cost

Use the following tools and techniques to monitor resource usage and cost:

Control resource provisioning

Use the following recommendations to control the quantity of resources provisioned in the cloud and the location where the resources are created:

  • To help ensure that resource consumption and cost don't exceed the forecast, use resource quotas.
  • Provision resources in the lowest-cost region that meets the latency requirements of your workload. To control where resources are provisioned, you can use the organization policy constraint gcp.resourceLocations.

Get discounts for committed use

Committed use discounts (CUDs) are ideal for workloads with predictable resource needs. After migrating your workload to Google Cloud, find the baseline for the resources required, and get deeper discounts for committed usage. For example, purchase a one or three-year commitment, and get a substantial discount on Compute Engine VM pricing.

Automate cost-tracking using labels

Define and assign labels consistently. The following are examples of how you can use labels to automate cost-tracking:

  • For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary.

  • For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost.

Customize billing reports

Configure your Cloud Billing reports by setting up the required filters and grouping the data as necessary (for example, by projects, services, or labels).

Promote a cost-saving culture

Train your developers and operators on your cloud infrastructure. Create and promote learning programs using traditional or online classes, discussion groups, peer reviews, pair programming, and cost-saving games. As shown in Google's DORA research, organizational culture is a key driver for improving performance, reducing rework and burnout, and optimizing cost. By giving employees visibility into the cost of their resources, you help them align their priorities and activities with business objectives and constraints.

Compute Engine

This section provides guidance to help you optimize the cost of your Compute Engine resources. In addition to this guidance, we recommend that you follow the general recommendations discussed earlier.

Understand the billing model

To learn about the billing options for Compute Engine, see Pricing.

Analyze resource consumption

To help you to understand resource consumption in Compute Engine, export usage data to BigQuery. Query the BigQuery datastore to analyze your project's virtual CPU (vCPU) usage trends, and determine the number of vCPUs that you can reclaim. If you've defined thresholds for the number of cores per project, analyze usage trends to spot anomalies and take corrective actions.

Reclaim idle resources

Use the following recommendations to identify and reclaim unused VMs and disks, such as VMs for proof-of-concept projects that have since been deprioritized:

  • Use the idle VM recommender to identify inactive VMs and persistent disks based on usage metrics.
  • Before deleting resources, assess the potential impact of the action and plan to recreate the resources if that becomes necessary.
  • Before deleting a VM, consider taking a snapshot. When you delete a VM, the attached disks are deleted, unless you've selected the Keep disk option.
  • When feasible, consider stopping VMs instead of deleting them. When you stop a VM, the instance is terminated, but disks and IP addresses are retained until you detach or delete them.

Adjust capacity to match demand

Schedule your VMs to start and stop automatically. For example, if a VM is used only eight hours a day for five days a week (that’s 40 hours in the week), you can potentially reduce costs by 75 percent by stopping the VM during the 128 hours in the week when the VM is not used. To learn more, see Cost optimization using automated VM management.

Autoscale compute capacity based on demand by using managed instance groups. You can autoscale capacity based on the parameters that matter to your business (for example, CPU usage or load-balancing capacity).

Choose appropriate machine types

Size your VMs to match your workload's compute requirements by using the VM machine type recommender.

For workloads with predictable resource requirements, tailor the machine type to your needs and save money by using custom VMs.

For batch-processing workloads that are fault-tolerant, consider using preemptible VM instances. High-performance computing (HPC), big data, media transcoding, continuous integration and continuous delivery (CI/CD) pipelines, and stateless web applications are examples of workloads that can be deployed on preemptible VMs. For an example of how Descartes Labs reduced their analysis costs by using preemptible VMs to process satellite imagery, see the Descartes Labs case study.

Evaluate licensing options

When you migrate third-party workloads to Google Cloud, you might be able to reduce cost by bringing your own licenses (BYOL). For example, to deploy Microsoft Windows Server VMs, instead of using a premium image that incurs additional cost for the third-party license, you can create and use a custom Windows BYOL image. You then pay only for the VM infrastructure that you use on Google Cloud. This strategy helps you continue to realize value from your existing investments in third-party licenses.

If you decide to use a BYOL approach, we recommend that you do the following:

  • Provision the required number of compute CPU cores independently of memory by using custom machine types, and limit the third-party licensing cost to the number of CPU cores that you need.
  • Reduce the number of vCPUs per core from 2 to 1 by disabling simultaneous multithreading (SMT), and reduce your licensing costs by 50 percent.

If your third-party workloads need dedicated hardware to meet security or compliance requirements, you can bring your own licenses to sole-tenant nodes.

Google Kubernetes Engine

This section provides guidance to help you optimize the cost of your GKE resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Use Cloud Monitoring to get real-time information about your GKE clusters (spending, bin-packing, application right-sizing, and scaling).
  • Use GKE Autopilot to let GKE maximize the efficiency of your cluster's infrastructure. You don't need to monitor the health of your nodes, handle bin-packing, or calculate the capacity that your workloads need.
  • Fine-tune GKE autoscaling by using Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), Cluster Autoscaler (CA), or node auto-provisioning based on your workload's requirements.
  • For batch workloads that aren't sensitive to startup latency, use the optimization-utilization autoscaling profile to help improve the utilization of the cluster.
  • Use node auto-provisioning to extend the GKE cluster autoscaler, and efficiently create and delete node pools based on the specifications of pending pods without over-provisioning.
  • Use separate node pools: a static node pool for static load, and dynamic node pools with cluster autoscaling groups for dynamic loads.
  • Use preemptible VMs for Kubernetes node pools when your pods are fault-tolerant and can terminate gracefully in less than 25 seconds. Combined with the GKE cluster autoscaler, this strategy helps you ensure that the node pool with lower-cost VMs (in this case, the preemptible node pool) scales first.
  • Choose cost-efficient machine types (for example: E2, N2D, T2D), which provide 20–40% higher performance-to-price.
  • Use GKE usage metering to analyze your clusters' usage profiles by namespaces and labels. Identify the team or application that's spending the most, the environment or component that caused spikes in usage or cost, and the team that's wasting resources.
  • Use resource quotas in multi-tenant clusters to prevent any tenant from using more than its assigned share of cluster resources.
  • Schedule automatic downscaling of development and test environments after business hours.
  • Follow the best practices for running cost-optimized Kubernetes applications on GKE.

Cloud Run

This section provides guidance to help you optimize the cost of your Cloud Run resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Adjust the concurrency setting (default: 80) to reduce cost. Cloud Run determines the number of requests to be sent to an instance based on CPU and memory usage. By increasing the request concurrency, you can reduce the number of instances required.
  • Set a limit for the number of instances that can be deployed.
  • Estimate the number of instances required by using the Billable Instance Time metric. For example, if the metric shows 100s/s, around 100 instances were scheduled. Add a 30% buffer to preserve performance; that is, 130 instances for 100s/s of traffic.
  • To reduce the impact of cold starts, configure a minimum number of instances. When these instances are idle, they are billed at a tenth of the price.
  • Track CPU usage, and adjust the CPU limits accordingly.
  • Use traffic management to determine a cost-optimal configuration.
  • Consider using Cloud CDN or Firebase Hosting for serving static assets.
  • For Cloud Run apps that handle requests globally, consider deploying the app to multiple regions, because cross continent egress traffic can be expensive. This design is recommended if you use a load balancer and CDN.
  • Reduce the startup times for your instances, because the startup time is also billable.
  • Purchase Committed Use Discounts, and save up to 17% off the on-demand pricing for a one-year commitment.

Cloud Functions

This section provides guidance to help you optimize the cost of your Cloud Functions resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Observe the execution time of your functions. Experiment and benchmark to design the smallest function that still meets your required performance threshold.
  • If your Cloud Functions workloads run constantly, consider using GKE or Compute Engine to handle the workloads. Containers or VMs might be lower-cost options for always-running workloads.
  • Limit the number of function instances that can co-exist.
  • Benchmark the runtime performance of the Cloud Functions programming languages against the workload of your function. Programs in compiled languages have longer cold starts, but run faster. Programs in interpreted languages run slower, but have a lower cold-start overhead. Short, simple functions that run frequently might cost less in an interpreted language.
  • Delete temporary files written to the local disk, which is an in-memory file system. Temporary files consume memory that's allocated to your function, and sometimes persist between invocations. If you don't delete these files, an out-of-memory error might occur and trigger a cold start, which increases the execution time and cost.

App Engine

This section provides guidance to help you optimize the cost of your App Engine resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Set maximum instances based on your traffic and request latency. App Engine usually scales capacity based on the traffic that the applications receive. You can control cost by limiting the number of instances that App Engine can create.
  • To limit the memory or CPU available for your application, set an instance class. For CPU-intensive applications, allocate more CPU. Test a few configurations to determine the optimal size.
  • Benchmark your App Engine workload in multiple programming languages. For example, a workload implemented in one language may need fewer instances and lower cost to complete tasks on time than the same workload programmed in another language.
  • Optimize for fewer cold starts. When possible, reduce CPU-intensive or long-running tasks that occur in the global scope. Try to break down the task into smaller operations that can be "lazy loaded" into the context of a request.
  • If you expect bursty traffic, configure a minimum number of idle instances that are pre-warmed. If you are not expecting traffic, you can configure the minimum idle instances to zero.
  • To balance performance and cost, run an A/B test by splitting traffic between two versions, each with a different configuration. Monitor the performance and cost of each version, tune as necessary, and decide the configuration to which traffic should be sent.
  • Configure request concurrency, and set the maximum concurrent requests higher than the default. The more requests each instance can handle concurrently, the more efficiently you can use existing instances to serve traffic.

What's next