Advanced load balancing optimizations

This page describes how to use a service load balancing policy to support advanced cost, latency, and resiliency optimizations for the following load balancers:

Global external Application Load Balancer
Cross-region internal Application Load Balancer
Global external proxy Network Load Balancer
Cross-region internal proxy Network Load Balancer

Cloud Service Mesh also supports advanced load balancing optimizations. For details, see Advanced load balancing overview in the Cloud Service Mesh documentation.

A service load balancing policy (serviceLbPolicy) is a resource associated with the load balancer's backend service. A service load balancing policy lets you customize the parameters that influence how traffic is distributed within the backends associated with a backend service:

Customize the load balancing algorithm used to determine how traffic is distributed within a particular region or a zone.
Enable auto-capacity draining so that the load balancer can quickly drain traffic from unhealthy backends.
Set a failover threshold to determine when a backend is considered unhealthy. This lets traffic fail over to a different backend to avoid unhealthy backends.

Additionally, you can designate specific backends as preferred backends. These backends must be used to capacity before requests are sent to the remaining backends.

The following diagram shows how Cloud Load Balancing evaluates routing, load balancing, and traffic distribution.

How Cloud Load Balancing makes routing and traffic distribution decisions.

Before you begin

Before reviewing the contents of this page, carefully review the Request distribution process described on the External Application Load Balancer overview page. For load balancers that are always Premium Tier, all the load balancing algorithms described on this page support spilling over between regions if a first-choice region is already full.

Supported backends

Service load balancing policies and preferred backends can be configured only on load balancers that use the supported backends as indicated in the following table.

Backend	Supported?
Instance groups Unmanaged Zonal managed
Regional MIGs
Zonal NEGs (`GCE_VM_IP_PORT` endpoints)
Hybrid NEGs (`NON_GCP_PRIVATE_IP_PORT` endpoints)
Serverless NEGs
Internet NEGs
Private Service Connect NEGs

Load balancing algorithms

This section describes the load balancing algorithms that you can configure in a service load balancing policy. If you don't configure an algorithm, or if you don't configure a service load balancing policy at all, the load balancer uses WATERFALL_BY_REGION by default.

Waterfall by region

WATERFALL_BY_REGION is the default load balancing algorithm. With this algorithm, in aggregate, all the Google Front Ends (GFEs) in a region attempt to fill backends in proportion to their configured target capacities (modified by their capacity scalers).

Each individual second-layer GFE prefers to select backend instances or endpoints in a zone that's as close as possible (defined by network round-trip time) to the second-layer GFE. Because WATERFALL_BY_REGION minimizes latency between zones, at low request rates, each second-layer GFE might exclusively send requests to backends in the second-layer GFE's preferred zone.

Spray to region

The SPRAY_TO_REGION algorithm modifies the individual behavior of each second-layer GFE to the extent that each second-layer GFE has no preference for selecting backend instances or endpoints that are in a zone as close as possible to the second-layer GFE. With SPRAY_TO_REGION, each second-layer GFE sends requests to all backend instances or endpoints, in all zones of the region, without preference for a shorter round-trip time between the second-layer GFE and the backend instances or endpoints.

Like WATERFALL_BY_REGION, in aggregate, all second-layer GFEs in the region fill backends in proportion to their configured target capacities (modified by their capacity scalers).

While SPRAY_TO_REGION provides more uniform distribution among backends in all zones of a region, especially at low request rates, this uniform distribution comes with the following considerations:

When backends go down (but continue to pass their health checks), more second-layer GFEs are affected, though individual impact is less severe.
Because each second-layer GFE has no preference for one zone over another, the second-layer GFEs create more cross-zone traffic. Depending on the number of requests being processed, each second-layer GFE might create more TCP connections to the backends as well.

Waterfall by zone

The WATERFALL_BY_ZONE algorithm modifies the individual behavior of each second-layer GFE to the extent that each second-layer GFE has a very strong preference to select backend instances or endpoints that are in the closest-possible zone to the second-layer GFE. With WATERFALL_BY_ZONE, each second-layer GFE only sends requests to backend instances or endpoints in other zones of the region when the second-layer GFE has filled (or proportionally overfilled) backend instances or endpoints in its most favored zone.

Like WATERFALL_BY_REGION, in aggregate, all second-layer GFEs in the region fill backends in proportion to their configured target capacities (modified by their capacity scalers).

The WATERFALL_BY_ZONE algorithm minimizes latency with the following considerations:

WATERFALL_BY_ZONE does not inherently minimize cross-zone connections. The algorithm is steered by latency only.
WATERFALL_BY_ZONE does not guarantee that each second-layer GFE always fills its most favored zone before filling other zones. Maintenance events can temporarily cause all traffic from a second-layer GFE to be sent to backend instances or endpoints in another zone.
WATERFALL_BY_ZONE can result in less uniform distribution of requests among all backend instances or endpoints within the region as a whole. For example, backend instances or endpoints in the second-layer GFE's most favored zone might be filled to capacity while backends in other zones are not filled to capacity.

Compare load balancing algorithms

The following table compares the different load balancing algorithms.

Behavior	Waterfall by region	Spray to region	Waterfall by zone
Uniform capacity usage within a single region	Yes	Yes	No
Uniform capacity usage across multiple regions	No	No	No
Uniform traffic split from load balancer	No	Yes	No
Cross-zone traffic distribution	Yes. Traffic is distributed evenly across zones in a region while optimizing network latency. Traffic might be sent across zones if needed.	Yes	Yes. Traffic first goes to the nearest zone until it is at capacity. Then, it goes to the next closest zone.
Sensitivity to traffic spikes in a local zone	Average; depends on how much traffic has already been shifted to balance across zones.	Lower; single zone spikes are spread across all zones in the region.	Higher; single zone spikes are likely to be served entirely by the same zone until the load balancer is able to react.

Auto-capacity draining

When a backend is unhealthy, you usually want to exclude it from load balancing decisions as fast as possible. Excluding unhealthy backends optimizes overall latency by sending traffic only to healthy backends.

When you enable the auto-capacity draining feature, the load balancer automatically scales a backend's capacity to zero when less than 25 percent of the backend's instances or endpoints are passing health checks. This removes the unhealthy backend from the global load balancing pool. This action is functionally equivalent to setting backendService.capacityScaler to 0 for a backend when you want to avoid routing traffic to that backend.

If 35 percent (10 percent above the threshold) of a previously auto-drained backend's instances or endpoints are passing health checks for 60 seconds, the backend is automatically undrained and added back to the load balancing pool. This ensures that the backend is truly healthy and does not flip-flop between a drained and undrained state.

Even with auto-capacity draining enabled, the load balancer doesn't drain more than 50 percent of backends attached to a backend service, regardless of a backend's health status. Keeping 50 percent of backends attached reduces the risk of overloading healthy backends.

One use-case of auto-capacity draining is to use it to minimize the risk of overloading your preferred backends. For example, if a backend is marked preferred but most of its instances or endpoints are unhealthy, auto-capacity draining removes the backend from the load balancing pool. Instead of overloading the remaining healthy instances or endpoints in the preferred backend, auto-capacity draining shifts traffic to other backends.

You can enable auto-capacity draining as part of the service load balancing policy. For details, see Configure a service load balancing policy.

Auto-capacity is not supported with backends that don't use a balancing mode. This includes backends such as internet NEGs, serverless NEGs, and PSC NEGs.

Failover threshold

The load balancer determines the distribution of traffic among backends in a multi-level fashion. In the steady state, it sends traffic to backends that are selected based on one of the previously described load balancing algorithms. These backends, called primary backends, are considered optimal in terms of latency and capacity.

The load balancer also keeps track of other backends that can be used if the primary backends become unhealthy and are unable to handle traffic. These backends are called failover backends. These backends are typically nearby backends with remaining capacity.

If instances or endpoints in the primary backend become unhealthy, the load balancer doesn't shift traffic to other backends immediately. Instead, the load balancer first shifts traffic to other healthy instances or endpoints in the same backend to help stabilize traffic load. If too many endpoints in a primary backend are unhealthy, and the remaining endpoints in the same backend are not able to handle the extra traffic, the load balancer uses the failover threshold to determine when to start sending traffic to a failover backend. The load balancer tolerates unhealthiness in the primary backend up to the failover threshold. After that, traffic is shifted away from the primary backend.

The failover threshold is a value between 1 and 99, expressed as a percentage of endpoints in a backend that must be healthy. If the percentage of healthy endpoints falls below the failover threshold, the load balancer tries to send traffic to a failover backend. By default, the failover threshold is 70.

If the failover threshold is set too high, unnecessary traffic spills can occur due to transient health changes. If the failover threshold is set too low, the load balancer continues to send traffic to the primary backends even though there are a lot of unhealthy endpoints.

Failover decisions are localized. Each local Google Front End (GFE) behaves independently of the other. It is your responsibility to make sure that your failover backends can handle the additional traffic.

Failover traffic can result in overloaded backends. Even if a backend is unhealthy, the load balancer might still send traffic there. To exclude unhealthy backends from the pool of available backends, enable the auto-capacity drain feature.

Preferred backends

Preferred backends are backends whose capacity you want to completely use before spilling traffic over to other backends. Any traffic over the configured capacity of preferred backends is routed to the remaining non-preferred backends. The load balancing algorithm then distributes traffic between the non-preferred backends of a backend service.

You can configure your load balancer to prefer and completely use one or more backends attached to a backend service before routing subsequent requests to the remaining backends.

Consider the following limitations when you use preferred backends:

The backends configured as preferred backends might be further away from the clients and result in higher average latency for client requests. This happens even if there are other closer backends which could have served the clients with lower latency.
Certain load balancing algorithms (WATERFALL_BY_REGION, SPRAY_TO_REGION, and WATERFALL_BY_ZONE) don't apply to backends configured as preferred backends.

To learn how to set preferred backends, see Set preferred backends.

Configure a service load balancing policy

The service load balancing policy resource lets you configure the following fields:

Load balancing algorithm
Auto-capacity draining
Failover threshold

To set a preferred backend, see Set preferred backends.

Create a policy

To create and configure a service load balancing policy, complete the following steps:

Create a service load balancing policy resource. You can do this either by using a YAML file or directly, by using gcloud parameters.
- With a YAML file. You specify service load balancing policies in a YAML file. Here is a sample YAML file that shows you how to configure a load balancing algorithm, enable auto-capacity draining, and to set a custom failover threshold:
```
name: projects/PROJECT_ID/locations/global/serviceLbPolicies/SERVICE_LB_POLICY_NAME
autoCapacityDrain:
    enable: True
failoverConfig:
    failoverHealthThreshold: FAILOVER_THRESHOLD_VALUE
loadBalancingAlgorithm: LOAD_BALANCING_ALGORITHM
```
  Replace the following:
  - PROJECT_ID: the project ID.
  - SERVICE_LB_POLICY_NAME: the name of the service load balancing policy.
  - FAILOVER_THRESHOLD_VALUE: the failover threshold value. This should be a number between 1 and 99.
  - LOAD_BALANCING_ALGORITHM: the load balancing algorithm to be used. This can be either SPRAY_TO_REGION, WATERFALL_BY_REGION, or WATERFALL_BY_ZONE.
  After you create the YAML file, import the file to a new service load balancing policy.
```
gcloud network-services service-lb-policies import SERVICE_LB_POLICY_NAME \
 --source=PATH_TO_POLICY_FILE \
 --location=global
```
- Without a YAML file. Alternatively, you can configure service load balancing policy features without using a YAML file.
  
  To set the load balancing algorithm and enable auto-draining, use the following parameters:
```
gcloud network-services service-lb-policies create SERVICE_LB_POLICY_NAME \
 --load-balancing-algorithm=LOAD_BALANCING_ALGORITHM \
 --auto-capacity-drain \
 --failover-health-threshold=FAILOVER_THRESHOLD_VALUE \
 --location=global
```
  Replace the following:
  - SERVICE_LB_POLICY_NAME: the name of the service load balancing policy.
  - LOAD_BALANCING_ALGORITHM: the load balancing algorithm to be used. This can be either SPRAY_TO_REGION, WATERFALL_BY_REGION, or WATERFALL_BY_ZONE.
  - FAILOVER_THRESHOLD_VALUE: the failover threshold value. This should be a number between 1 and 99.

Update a backend service so that its --service-lb-policy field references the newly created service load balancing policy resource. A backend service can only be associated with one service load balancing policy resource.

gcloud compute backend-services update BACKEND_SERVICE_NAME \
  --service-lb-policy=SERVICE_LB_POLICY_NAME \
  --global

You can associate a service load balancing policy with a backend service while creating the backend service.

gcloud compute backend-services create BACKEND_SERVICE_NAME \
    --protocol=PROTOCOL \
    --port-name=NAMED_PORT_NAME \
    --health-checks=HEALTH_CHECK_NAME \
    --load-balancing-scheme=LOAD_BALANCING_SCHEME \
    --service-lb-policy=SERVICE_LB_POLICY_NAME \
    --global

Remove a policy

To remove a service load balancing policy from a backend service, use the following command:

gcloud compute backend-services update BACKEND_SERVICE_NAME \
    --no-service-lb-policy \
    --global

Set preferred backends

You can configure preferred backends by using either the Google Cloud CLI or the API.

gcloud

Add a preferred backend

To set a preferred backend, use the gcloud compute backend-services add-backend command to set the --preference flag when you're adding the backend to the backend service.

gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
    ...
    --preference=PREFERENCE \
    --global

Replace PREFERENCE with the level of preference you want to assign to the backend. This can be either PREFERRED or DEFAULT.

The rest of the command depends on the type of backend you're using (instance group or NEG). For all the required parameters, see the gcloud compute backend-services add-backend command.

Update a backend's preference

To update a backend's --preference parameter, use the gcloud compute backend-services update-backend command.

gcloud compute backend-services update-backend BACKEND_SERVICE_NAME \
    ...
    --preference=PREFERENCE \
    --global

The rest of the command depends on the type of backend you're using (instance group or NEG). The following example command updates a backend instance group's preference and sets it to PREFERRED:

gcloud compute backend-services update-backend BACKEND_SERVICE_NAME \
    --instance-group=INSTANCE_GROUP_NAME \
    --instance-group-zone=INSTANCE_GROUP_ZONE \
    --preference=PREFERRED \
    --global

API

To set a preferred backend, set the preference flag on each backend by using the global backendServices resource.

Here is a sample that shows you how to configure the backend preference:

  name: projects/PROJECT_ID/locations/global/backendServices/BACKEND_SERVICE_NAME
  ...
  - backends
      name: BACKEND_1_NAME
      preference: PREFERRED
      ...
  - backends
      name: BACKEND_2_NAME
      preference: DEFAULT
      ...

Replace the following:

PROJECT_ID: the project ID
BACKEND_SERVICE_NAME: the name of the backend service
BACKEND_1_NAME: the name of the preferred backend
BACKEND_2_NAME: the name of the default backend

Troubleshooting

Traffic distribution patterns can change when you attach a new service load balancing policy to a backend service.

To debug traffic issues, use Cloud Monitoring to look at how traffic flows between the load balancer and the backend. Cloud Load Balancing logs and metrics can also help you understand load balancing behavior.

This section summarizes a few common scenarios that you might see in the newly exposed configuration.

Traffic from a single source is sent to too many distinct backends

This is the intended behavior of the SPRAY_TO_REGION algorithm. However, you might experience issues caused by wider distribution of your traffic. For example, cache hit rates might decrease because backends see traffic from a wider selection of clients. In this case, consider using other algorithms like WATERFALL_BY_REGION.

Traffic is not being sent to backends with lots of unhealthy endpoints

This is the intended behavior when autoCapacityDrain is enabled. Backends with a lot of unhealthy endpoints are drained and removed from the load balancing pool. If you don't want this behavior, you can disable auto-capacity draining. However, this means that traffic can be sent to backends with a lot of unhealthy endpoints and requests can fail.

Traffic is being sent to more distant backends before closer ones

This is the intended behavior if your preferred backends are further away than your default backends. If you don't want this behavior, update the preference settings for each backend accordingly.

Traffic is not being sent to some backends when using preferred backends

This is the intended behavior when your preferred backends have not yet reached capacity. The preferred backends are assigned first based on round-trip time latency to these backends.

If you want traffic sent to other backends, you can do one of the following:

Update preference settings for the other backends.
Set a lower target capacity setting for your preferred backends. The target capacity is configured by using either the max-rate or the max-utilization fields depending on the backend service's balancing mode.

Traffic is being sent to a remote backend during transient health changes

This is the intended behavior when the failover threshold is set to a high value. If you want traffic to keep going to the primary backends when there are transient health changes, set this field to a lower value.

Healthy endpoints are overloaded when other endpoints are unhealthy

This is the intended behavior when the failover threshold is set to a low value. When endpoints are unhealthy, the traffic intended for these unhealthy endpoints is instead spread among the remaining endpoints in the same backend. If you want the failover behavior to be triggered sooner, set this field to a higher value.

Limitations

Each backend service can only be associated with a single service load balancing policy resource.