Performance and cost optimization

This section of the architecture framework discusses how to balance performance and cost optimizations in your deployments.

The framework consists of the following series of articles:


Evaluate performance requirements. Determine the priority of your various applications and what minimum performance you require of them.

Use scalable design patterns. Improve scalability and performance with autoscaling, compute choices, and storage configurations.

Identify and implement cost-saving approaches. Evaluate cost for each running service while associating priority to optimize for service availability and cost.

Best practices

  • Use autoscaling and data processing.
  • Use GPUs and TPUs to increase performance.
  • Identify apps to tune.
  • Analyze your costs and optimize.

Use autoscaling and data processing

Use autoscaling so that as load increases or decreases, the services add or release resources to match.

Compute Engine autoscaling

Managed instance groups (MIGs) let you scale your stateless apps on multiple identical VMs, so that a group of Compute Engine resources is launched based on an instance template. You can configure an autoscaling policy to scale your group based on CPU utilization, load-balancing capacity, Cloud Monitoring metrics, schedules, and, for zonal MIGs, by a queue-based workload, like Pub/Sub.

Google Kubernetes Engine autoscaling

You can use the cluster autoscaler feature in Google Kubernetes Engine (GKE) to manage your cluster's node pool based on varying demand of your workloads. Cluster autoscaler increases or decreases the size of the node pool automatically, based on the resource requests (rather than actual resource utilization) of Pods running on that node pool's nodes.

Serverless autoscaling

Serverless compute options include Cloud Run, App Engine, and Cloud Functions, each of which provides autoscaling capabilities. Use these serverless options to scale your microservices or functions.

Data processing

Dataproc and Dataflow offer autoscaling options to scale your data pipelines and data processing. Use these options to allow your pipelines to access more computing resources based on the processing load.

Design questions

  • Which of your applications have variable user load or processing requirements?
  • Which of your data processing pipelines have variable data requirements?


  • Use Google Cloud Load Balancers to provide a global endpoint.
  • Use managed instance groups with Compute Engine to automatically scale.
  • Use the cluster autoscaler in GKE to automatically scale the cluster.
  • Use App Engine to autoscale your Platform-as-a-Service (PaaS) application.
  • Use Cloud Run or Cloud Functions to autoscale your function or microservice.

Key services


Use GPUs and TPUs to increase performance

Google Cloud provides options to accelerate the performance of your workloads. You can use these specialized hardware platforms to increase your application and data processing performance.

Graphics Processing Unit (GPU)

Compute Engine provides GPUs that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.

Tensor Processing Unit (TPU)

A TPU is specifically designed as a matrix processor by Google for machine learning workloads. TPUs are best suited for massive matrix operations with a large pipeline, with significantly less memory access.


Identify apps to tune

Application Performance Management (APM) includes tools to help you reduce latency and cost, so that you can run more efficient applications. With Cloud Trace, Cloud Debugger, and Cloud Profiler, you gain insight into how your code and services function, and you can troubleshoot if needed.


Latency plays a big role in determining your users' experience. When your application backend starts getting complex or you start adopting microservice architecture, it's challenging to identify latencies between inter-service communication or identify bottlenecks. Cloud Trace and OpenTelemetry tools help you scale collecting latency data from deployments and quickly analyze it.


Cloud Debugger helps you inspect and analyze your production code behavior in real time without affecting its performance or slowing it down.


Poorly performing code increases the latency and cost of applications and web services. Cloud Profiler helps you identify and address performance by continuously analyzing the performance of CPU or memory-intensive functions executed across an application.


  • Use Cloud Trace to instrument your applications.
  • Use Cloud Debugger to provide real-time production debugging capabilities.
  • Use Cloud Profiler to analyze the operating performance of your applications.

Analyze your costs and optimize

The first step in optimizing your cost is to understand your current usage and costs. Google Cloud provides an Export Billing to BigQuery feature that provides a detailed way to analyze your billing data. You can connect BigQuery to Google Data Studio or Looker, or to third-party business intelligence (BI) tools like Tableau or Qlik. Use the programmatic notifications feature to send notifications when your budget exceeds a certain threshold. You can use budget notifications with third-party solution providers as well as customized applications.

Sustained use discounts are automatic discounts for running specific Compute Engine resources for a significant portion of the billing month. Sustained use discount is granted for prolonged usage of certain Compute Engine virtual machine (VM) types.

Committed use discounts are ideal for workloads with predictable resources needs. When you purchase a committed use contract, you purchase a certain amount of vCPUs, memory, GPUs, and local SSDs at a discounted price in return for committing to paying for those resources for 1 year or 3 years.

A Preemptible VM is an instance that you can create and run at a much lower price than normal instances. However, Compute Engine might terminate (that is, preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity, so their availability varies with usage.

When you understand which components make up your cost, you can decide how to optimize. Finding resources with low utilization or that aren't necessary is an excellent place to start. Compute Engine provides you with sizing recommendations for VMs that you can use to help size your resources. After you implement changes, you can compare your subsequent billing export data to view the differences in cost.

Want to forecast your usage cost? Use the Google Cloud Pricing Calculator.


  • Use unique labeling across your organization to track usage.
  • Use the Export Billing to BigQuery feature.
  • Use Data Studio or other visualization tools to visualize billing data reports.
  • Implement the right-sizing recommendations made by Compute Engine.
  • Identify 24/7 consumption on Compute Engine along with your predicted usage to buy Committed Usage Discount.
  • Use Cloud Storage Object Lifecycle Management to manage storage cost.