Scaling based on load balancing serving capacity

This document describes how to scale a managed instance group (MIG) based on the serving capacity of an external HTTP(S) load balancer. This means that autoscaling adds or removes VM instances in the group when the load balancer indicates that the group has reached a configurable fraction of its fullness, where fullness is defined by the target capacity of the selected balancing mode of the backend instance group.

You can also scale a MIG based on its CPU utilization or on Monitoring metrics.

Before you begin

Scaling based on HTTP(S) load balancing serving capacity

Compute Engine provides support for load balancing within your instance groups. You can use autoscaling in conjunction with load balancing by setting up an autoscaler that scales based on the load of your instances.

An external HTTP(S) load balancer distributes requests to backend services according to its URL map. The load balancer can have one or more backend services, each supporting instance group or network endpoint group (NEG) backends. When backends are instance groups, the external HTTP(S) load balancer offers two balancing modes: UTILIZATION and RATE. With UTILIZATION, you can specify a maximum target for average backend utilization of instances in the instance group. With RATE, you must specify a target number of requests per second on a per-instance basis or a per-group basis. (Only zonal instance groups support specifying a maximum rate for the whole group. Regional managed instance groups do not support defining a maximum rate per group.)

The balancing mode and the target capacity that you specify define the conditions under which Google Cloud determines when a backend VM is at full capacity. Google Cloud attempts to send traffic to healthy VMs that have remaining capacity. If all VMs are already at capacity, the target utilization or rate is exceeded.

When you attach an autoscaler to an instance group backend of an external HTTP(S) load balancer, the autoscaler scales the managed instance group to maintain a fraction of the load balancing serving capacity.

For example, assume the load balancing serving capacity of a managed instance group is defined as 100 RPS per instance. If you create an autoscaler with the HTTP(S) load balancing policy and set it to maintain a target utilization level of 0.8 or 80%, the autoscaler adds or removes instances from the managed instance group to maintain 80% of the serving capacity, or 80 RPS per instance.

The following diagram shows how the autoscaler interacts with a managed instance group and backend service:

The
  relationships between the autoscaler, managed instance groups, and load
  balancing backend services.
The autoscaler watches the serving capacity of the managed instance group, which is defined in the backend service, and scales based on the target utilization. In this example, the serving capacity is measured in the maxRatePerInstance value.

Applicable load balancing configurations

You can set one of three options for your load balancing serving capacity. When you first create the backend, you can choose among maximum backend utilization, maximum requests per second per instance, or maximum requests per second of the whole group. Autoscaling only works with maximum backend utilization and maximum requests per second/instance because the value of these settings can be controlled by adding or removing instances. For example, if you set a backend to handle 10 requests per second per instance, and the autoscaler is configured to maintain 80% of that rate, then the autoscaler can add or remove instances when the requests per second per instance changes.

Autoscaling does not work with maximum requests per group because this setting is independent of the number of instances in the instance group. The load balancer continuously sends the maximum number of requests per group to the instance group, regardless of how many instances are in the group.

For example, if you set the backend to handle 100 maximum requests per group per second, the load balancer sends 100 requests per second to the group, whether the group has two instances or 100 instances. Because this value cannot be adjusted, autoscaling does not work with a load balancing configuration that uses the maximum number of requests per second per group.

Enable autoscaling based on load balancing serving capacity

Console


  1. Go to the Instance groups page in the Google Cloud Console.

    Go to the Instance groups page

  2. If you have an instance group, select it, and then click Edit group. If you don't have an instance group, click Create instance group.
  3. Under Autoscaling, select On.
  4. Under Autoscale based on, select HTTP load balancing utilization.
  5. Enter the Target load balancing utilization. This value is treated as a percentage. For example, for 60% HTTP load balancing utilization, enter 60.
  6. Provide a number for the maximum number of instances that you want in this instance group. You can also set the minimum number of instances and the cool down period. The cool down period is the number of seconds the autoscaler should wait after a virtual machine has started before the autoscaler starts collecting information from it. This accounts for the amount of time it might take for the instance to initialize, during which the collected data is not reliable for autoscaling. The default cool down period is 60 seconds.
  7. Save your changes.

gcloud


To enable an autoscaler that scales on serving capacity, use the set-autoscaling sub-command. For example, the following command creates an autoscaler that scales the target managed instance group to maintain 60% of the serving capacity. Along with the --target-load-balancing-utilization parameter, the --max-num-replicas parameter is also required when creating an autoscaler:

gcloud compute instance-groups managed set-autoscaling example-managed-instance-group \
    --max-num-replicas 20 \
    --target-load-balancing-utilization 0.6 \
    --cool-down-period 90

Optionally, you can use the --cool-down-period flag, which tells the autoscaler how many seconds to wait after a new instance has started before it considers data from the new instance. This cool down period accounts for the amount of time it might take for the instance to initialize, during which the collected utilization data is not reliable for autoscaling. After the cool-down period passes, the autoscaler begins to include the instance's utilization data to determine whether the group needs to scale. The default cool down period is 60 seconds.

You can verify that your autoscaler was successfully created by using the describe sub-command:

gcloud compute instance-groups managed describe example-managed-instance-group

For a list of available gcloud commands and flags, see the gcloud reference.

API


Note: Although autoscaling is a feature of managed instance groups, it is a separate API resource. Keep that in mind when you construct API requests for autoscaling.

In the API, make a POST request to the autoscalers.insert method:

POST https://compute.googleapis.com/compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/autoscalers/

Your request body must contain the name, target, and autoscalingPolicy fields. autoscalingPolicy must define loadBalancingUtilization.

Optionally, you can use the coolDownPeriodSec field, which tells the autoscaler how many seconds to wait after a new instance has started before it considers data from the new instance. This cool down period accounts for the amount of time it might take for the instance to initialize, during which the collected utilization data is not reliable for autoscaling. After the cool-down period passes, the autoscaler begins to include the instance's utilization data to determine whether the group needs to scale. The default cool down period is 60 seconds.

{

 "name": "example-autoscaler",
 "target": "zones/us-central1-f/instanceGroupManagers/example-managed-instance-group",
 "autoscalingPolicy": {
    "maxNumReplicas": 20,
    "loadBalancingUtilization": {
       "utilizationTarget": 0.8
     },
    "coolDownPeriodSec": 90
  }
}

For more information about enabling autoscaling based on load balancing serving capacity, complete the tutorial, Globally autoscaling a web service on Compute Engine.

Scaling for other types of load balancers

To automatically scale a managed instance group that is used as a backend for other types of Google Cloud load balancers, use a different autoscaler policy.