Backend Services

An HTTP(S) load balancing backend service is a centralized service for managing backends, which in turn manage instances that handle user requests. You configure your load balancing service to route requests to your backend service. The backend service in turn knows which instances it can use, how much traffic they can handle, and how much traffic they are currently handling. In addition, the backend service monitors health checking and does not send traffic to unhealthy instances.

Backend service components

A configured backend service contains one or more backends. Each backend service contains the following:

  • A health check. The health checker polls instances attached to the backend service at configured intervals. Instances that pass the health check are allowed to receive new requests. Unhealthy instances are not sent requests until they are healthy again.
  • Session affinity (optional). Normally, HTTP(S) load balancing uses a round-robin algorithm to distribute requests among available instances. This can be overridden with session affinity. Session affinity attempts to send all request from the same client to the same virtual machine instance.
  • A Timeout setting. Set to 30s by default, this is the amount of time the backend service will wait on the backend before considering the request a failure. This is a fixed timeout, not an idle timeout. If you require longer-lived connections, set this value appropriately using the API, SDK, or Cloud Platform Console.
  • One or more backends. A backend contains the following:
    • An Instance Group containing virtual machine instances. The instance group may be a Managed Instance Group (with or without Autoscaling) or an Unmanaged Instance Group. A backend cannot be added to a backend service if it doesn't contain an instance group.
    • A balancing mode, which tells the load balancing system how to determine when the backend is at full usage. If all the backends for the backend service in a region are at full usage, new requests are automatically routed to the nearest region that can still handle requests. The balancing mode can be based on CPU Utilization or Rate (requests per second (RPS)).
    • A capacity setting. Capacity is an additional control that interacts with the balancing mode setting. For example, if you normally want your instances to operate at a maximum of 80% CPU utilization, you would set your balancing mode to 80% CPU utilization and your capacity to 100%. If you want to cut instance utilization in half, you could leave the balancing mode at 80% CPU utilization and set Capacity to 50%. To drain the backend service, set Capacity to 0% and leave the balancing mode as is.

A backend service may only have up to 500 endpoints (IP address and port pairs) in a given zone.

See the backend service API resource or the gcloud command-line tool user guide for descriptions of the properties that are available when working with backend services.

Backend services and regions

HTTP(S) load balancing is a global service. You may have more than one backend service in a region, and you may assign backend services to more than one region, all serviced by the same global load balancer. Traffic is allocated to backend services as follows:

  1. When a user request comes in, the load balancing service determines the approximate origin of the request from the source IP address.
  2. The load balancing service knows the locations of the instances owned by the backend service, their overall capacity, and their overall current usage.
  3. If the closest instances to the user have available capacity, then the request is forwarded to that closest set of instances.
  4. Incoming requests to the given region are distributed evenly across all available backend services and instances in that region. However, at very small loads, the distribution may appear to be uneven.
  5. If there are no healthy instances with available capacity in a given region, the load balancer instead sends the request to the next closest region with available capacity.

Session affinity

By default, HTTP(S) load balancing distributes requests evenly among available instances. However, some applications, such as stateful servers used by ads serving, games or services with heavy internal caching, need multiple requests from a given user to end up on the same instance. Session affinity makes this possible, identifying requests from a user by the client IP or the value of a cookie and directing such requests to a consistent instance as long as that instance is healthy and has capacity. Affinity can break if the instance becomes unhealthy or overloaded, so your system must not assume perfect affinity.

Setting session affinity

HTTP(S) load balancing offers the following types of session affinity: client IP affinity and generated cookie affinity.

Client IP affinity

Client IP affinity directs requests from the same client IP address to the same instance based on a hash of the IP address. This is simple and does not involve a user cookie. However, because of NATs, CDNs, and other internet routing technologies, sometimes requests from multiple independent users can look as if they come from the same client, causing many users to clump unnecessarily onto the same instances. In addition, clients who move from one network to another may change IP address, thus losing affinity.


  1. In the Google Cloud Platform Console, you can modify session affinity in the Backend configuration portion of the HTTP(S) load balancer page.
    Go to the Load balancing page
  2. Select Edit for your load balancer.
  3. Select Backend configuration.
  4. Select Client IP from the Session affinity pull-down menu to turn on client IP session affinity. Ignore the Affinity cookie TTL field as it has no meaning with client IP affinity.


You can use the create command to set session affinity for a new backend service, or the update command to set it for an existing backend service. This example shows using it with the update command.

gcloud compute backend-services update [BACKEND_SERVICE_NAME] \
    --session-affinity client_ip

With generated cookie affinity, the load balancer issues a cookie named GCLB on the first request and then directs each subsequent request that has the same cookie to the same instance. Cookie-based affinity allows the load balancer to distinguish different clients using the same IP address so it can spread those clients across the instances more evenly. Cookie-based affinity allows the load balancer to maintain instance affinity even when the client’s IP address changes.

The path of the cookie is always /, so if different backend services on the same hostname both enable cookie-based affinity, they will both be balanced by the same cookie.

The lifetime of the HTTP cookie generated by the load balancer is configurable. It can be set to 0 (default), which means the cookie is only a session cookie, or it can have a lifetime of 1 to 86400 seconds (24 hours).


  1. In the Google Cloud Platform Console, you can modify session affinity in the Backend configuration portion of the HTTP(S) load balancer page.
    Go to the Load balancing page
  2. Select Edit for your load balancer.
  3. Select Backend configuration.
  4. Select Generated cookie from the Session affinity pull-down menu to turn on generated cookie affinity. In the Affinity cookie TTL field, set the cookie's lifetime in seconds.


Turn on generated cookie affinity by setting --session-affinity to generated_cookie and setting --affinity-cookie-ttl to the cookie lifetime in seconds. You can use the create command to set it for a new backend service, or the update command to set it for an existing backend service. This example shows using it with the update command.

gcloud compute backend-services update [BACKEND_SERVICE_NAME] \
    --session-affinity generated_cookie \
    --affinity-cookie-ttl 86400

Disabling session affinity

You can turn off session affinity by updating the backend service and setting session affinity to none, or you can edit the backend service and set session affinity to none in a text editor. You can also use either command to modify the cookie lifetime.


  1. In the Google Cloud Platform Console, you can modify session affinity in the Backend configuration portion of the HTTP(S) load balancer page.
    Go to the Load balancing page
  2. Select Edit for your load balancer.
  3. Select Backend configuration.
  4. Select None from the Session affinity pull-down menu to turn off session affinity.


gcloud compute backend-services update [BACKEND_SERVICE_NAME] \
    --session-affinity none

gcloud compute backend-services edit [BACKEND_SERVICE_NAME]

Losing session affinity

Regardless of the type of affinity chosen, a client can lose affinity with the instance in the following scenarios.

  • The instance group runs out of capacity, and traffic has to be routed to a different zone. In this case, traffic from existing sessions may be sent to the new zone, breaking affinity. You can mitigate this by ensuring that your instance groups have enough capacity to handle all local users.
  • Autoscaling adds instances to, or removes instances from, the instance group. In either case, the backend service reallocates load, and the target may move. You can mitigate this by ensuring that the minimum number of instances provisioned by autoscaling is enough to handle expected load, then only using autoscaling for unexpected increases in load.
  • The target instance fails health checks. Affinity is lost as the session is moved to a healthy instance.

Backend services and autoscaled managed instance groups

Autoscaled managed instance groups are useful if you need many machines all configured the same way, and you want to automatically add or remove instances based on need. First, create an instance template for your instance group. Then, create a managed instance group and assign the template to it. Lastly, turn on autoscaling for the managed instance group and select HTTP load balancing usage under Autoscale based on, then set a Target load balancing usage percentage.

The autoscaling percentage works with the backend service balancing mode. If, for example, you set the balancing mode to a CPU utilization of 80% and leave the capacity scaler at 100%, and you set the Target load balancing usage in the autoscaler to 80%, whenever CPU utilization of the group rises above 64% (80% of 80%), the autoscaler will instantiate new instances from the template until usage drops down to about 64%. If the overall usage drops below 64%, the autoscaler will remove instances until usage gets back to 64%.

New instances have a cooldown period before they are considered part of the group, so it's possible for traffic to exceed the backend service's 80% CPU utilization during that time, causing excess traffic to be routed to the next available backend service. Once the instances are available, new traffic will be routed to them. Also, if the number of instances reaches the maximum permitted by the autoscaler's settings, the autoscaler will stop adding instances no matter what the usage is. In this case, extra traffic will be load balanced to the next available region.

Restrictions and guidance

Because Compute Engine offers a great deal of flexibility in how you configure load balancing, it is possible to create configurations that do not behave well. Please keep the following restrictions and guidance in mind when creating instance groups for use with load balancing.

  • Do not put a virtual machine instance in more than one instance group.
  • Do not delete an instance group if it is being used by a backend.
  • Your configuration will be simpler if you do not add the same instance group to two different backends. If you do add the same instance group to two backends:
    • Both backends must use the same balancing mode, either UTILIZATION or RATE.
    • You can use maxRatePerInstance and maxRatePerGroup together. It is acceptable to set one backend to use maxRatePerInstance and the other to maxRatePerGroup.
    • If your instance group serves two or more ports for several backends respectively, you have to specify different port names in the instance group.
  • All instances in a managed or unmanaged instance group must be in the same VPC network and, if applicable, the same subnet.
  • If you are using a managed instance group with autoscaling, do not use the maxRate balancing mode in the backend service. You may use either the maxUtilization or maxRatePerInstance mode.
  • Do not make an autoscaled managed instance group the target of two different load balancers.
  • When resizing a managed instance group, the maximum size of the group should be smaller than or equal to the size of subnet.

Creating and modifying a backend service

To create or modify a backend service using the gcloud command-line tool, see the Cloud SDK documentation.

To create or modify a backend service with the API, see the API docs.

Backend services need a health check. Create a health check before creating the backend service. We recommend you use health check of the same protocol as the traffic you are load balancing.

Add instance groups to a backend service

To define the instances that are included in a backend service, you must add a backend and assign an instance group to it. You must create the instance group before you add it to the backend.

A backend service sends traffic to its backends through a named port.

Named ports are key-value metadata representing the service name and the ports that the service is running on. The port name is mapped to one or more port numbers in each instance group. Named ports can be assigned to an instance group, which indicates that the service is available on all instances in the group. This information is used by the HTTP(S) Load Balancing service and TCP/SSL proxy.

Only one port name may be added to a backend service, and that name must exist as a service on all instance groups that are a part of the backend service. Each instance in an instance group has the same set of named ports.

When adding the instance group to the backend, you must also define certain parameters. See the Cloud SDK documentation or the API docs.

Health checking

Each backend service must have a Health Check associated with it. This health check is run continuously and its results are used to determine which instances should receive new requests. Unhealthy instances do not receive new requests. Unhealthy instances continue to be polled. If an unhealthy instance passes a health check, it is deemed healthy and will begin receiving new connections.

Best practice configuration is to check health and serve traffic on the same port. However, it is possible to perform health checks on one port, but serve traffic on another. If you do use two different ports, be careful that firewall rules and services running on instances are configured appropriately. If you run them both on the same port, but decide to switch ports at some point, be sure to update both the backend service and the health check.

To view the results of the latest health check with the gcloud command-line tool, use the backend-services get-health command.

gcloud compute backend-services get-health [BACKEND_SERVICE]

The command returns a healthState value for all instances in the specified backend service, with a value of either HEALTHY or UNHEALTHY:

- healthState: UNHEALTHY
  instance: us-central1-b/instances/www-video1
- healthState: HEALTHY
  instance: us-central1-b/instances/www-video2
kind: compute#backendServiceGroupHealth

Backend services that do not have a valid global forwarding rule referencing it will not be health checked and so will have no health status.

Send feedback about...

Compute Engine Documentation