Load Balancing and Scaling

Compute Engine offers load balancing and autoscaling for groups of instances.


Load Balancing

Google Compute Engine offers server-side load balancing so you can distribute incoming network traffic across multiple virtual machine instances. Load balancing provides the following benefits with either network load balancing or HTTP(S) load balancing:

  • Scale your application
  • Support heavy traffic
  • Detect and automatically remove unhealthy virtual machines instances. Instances that become healthy again are automatically re-added.
  • Route traffic to the closest virtual machine

In addition, HTTP(S) load balancing also supports the following:

  • Balance loads across regions
  • Direct traffic to specific backends based on URLs

Google Compute Engine load balancing uses forwarding rule resources to match certain types of traffic and forward it to a load balancer. For example, a forwarding rule can match TCP traffic destined to port 80 on IP address, then forward it to a load balancer, which then directs it to healthy virtual machine instances.

Compute Engine load balancing is a managed service, which means its components are redundant and highly available. If a load balancing component fails, it is restarted or replaced automatically and immediately.

Google offers two types of load balancing that differ in capabilities, usage scenarios, and how you configure them. HTTP(S) load balancing is appropriate for HTTP and HTTPS traffic and can forward traffic across regions, to specific backends based on URL, or both. Network load balancing is appropriate for other types of traffic, but it is restricted to a single region.

The following scenarios can help you decide whether Network Load Balancing or HTTP(S) Load Balancing best meet your needs.

HTTP(S) load balancing

Cross-region load balancing

Representation of
  cross-region load balancing

You can use a global IP address that can intelligently route users based on proximity. For example, if you set up instances in North America, Europe, and Asia, users around the world will be automatically sent to the backends closest to them, assuming those instances have enough capacity. If the closest instances do not have enough capacity, cross-region load balancing automatically forwards users to the next closest region.

Get started with cross-region load balancing

Content-based load balancing

Representation of
  content-based load balancing

Content-based or content-aware load balancing uses HTTP(S) load balancing to distribute traffic to different instances based on the incoming URL. For example, you can set up some instances to handle your video content and another set to handle everything else. You can configure your load balancer to direct traffic for example.com/video to the video servers and example.com/ to the default servers.

Get started with content-based load balancing

Content-based and cross-region load-balancing can work together by using multiple backend services and multiple regions. You can build on top of the scenarios above to configure your own load balancing configuration that meets your needs.

Network load balancing

Network load

Assume that you are running a non-HTTP(S) service and you are starting to get a high enough level of traffic that you need to add additional instances to help respond to this load. You can add additional Google Compute Engine instances and configure load balancing to spread the load between these instances. In this situation, you would serve the same content from each of the instances. As your site becomes more popular, you would continue increasing the number of instances that are available to respond to requests.

Unlike HTTP(S) load balancing, Network load balancing cannot automatically forward traffic to a different region. If you wanted your service to function in different regions, you would have to set up a different IP address, load balancer, and set of instances for each region.

Get started with network load balancing


Compute Engine offers autoscaling to automatically add or remove virtual machines from an instance group based on increases or decreases in load. This allows your applications to gracefully handle increases in traffic and reduces cost when the need for resources is lower. You just define the autoscaling policy and the autoscaler performs automatic scaling based on the measured load.


Choose from a variety of policies that an autoscaler can use to scale your virtual machines. When you create an autoscaler, you must always specify a single policy for it; you cannot define more than one policy per autoscaler.

The following sections discuss each of these autoscaling policies in general; for more information about how to set up the specific autoscaling policy, see the respective policy documentation.

CPU utilization

CPU utilization is the most basic autoscaling that you can perform. This policy tells the autoscaler to watch the average CPU utilization of a group of virtual machines and add or remove virtual machines from the group to maintain your desired utilization. This is useful for configurations that are CPU-intensive but might fluctuate in CPU usage.

For more information, see Scaling Based on CPU utilization.

HTTP(S) load balancing serving capacity

Set up an autoscaler to scale based on HTTP(S) load balancing serving capacity and the autoscaler will watch the serving capacity of an instance group, and scale if the virtual machines are over or under capacity.

The serving capacity of an instance can be defined in the load balancer's backend service and can be based on either utilization or requests per second.

For more information, see Scaling Based on HTTP(S) load balancing.

Stackdriver Monitoring metrics

If you export or use Stackdriver Monitoring metrics, you can set up autoscaling to collect data of a specific metric and perform scaling based on your desired utilization level. It is possible to scale based on standard metrics provided by Stackdriver Monitoring, or using any custom metrics you create as well.

For more information, see Scaling Based on Stackdriver Monitoring Metrics.

Send feedback about...

Compute Engine Documentation