Container-native load balancing through Ingress

Autopilot Standard

This page explains how to use container-native load balancing in Google Kubernetes Engine (GKE). Container-native load balancing allows load balancers to target Kubernetes Pods directly and to evenly distribute traffic to Pods.

For more information on the benefits, requirements, and limitations of container-native load balancing, see Container-native load balancing.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Use container-native load balancing

The following sections walk you through a container-native load balancing configuration on GKE.

Create a VPC-native cluster

To use container-native load balancing, your GKE cluster must have alias IPs enabled.

For example, the following command creates a GKE cluster, neg-demo-cluster, with an auto-provisioned subnetwork:

For Autopilot mode, alias IP addresses are enabled by default:
```
gcloud container clusters create-auto neg-demo-cluster \
    --location=COMPUTE_LOCATION
```
Replace COMPUTE_LOCATION with the Compute Engine location for the cluster.

For Standard mode, enable alias IP addresses when you create the cluster:

gcloud container clusters create neg-demo-cluster \
    --enable-ip-alias \
    --create-subnetwork="" \
    --network=default \
    --location=us-central1-a

Create a Deployment

The following sample Deployment, neg-demo-app, runs a single instance of a containerized HTTP server. We recommend you use workloads that use Pod readiness feedback.

Using Pod readiness feedback

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: neg-demo-app # Label for the Deployment
  name: neg-demo-app # Name of Deployment
spec:
  selector:
    matchLabels:
      run: neg-demo-app
  template: # Pod template
    metadata:
      labels:
        run: neg-demo-app # Labels Pods from this Deployment
    spec: # Pod specification; each Pod created by this Deployment has this specification
      containers:
      - image: registry.k8s.io/serve_hostname:v1.4 # Application to run in Deployment's Pods
        name: hostname # Container name
        ports:
        - containerPort: 9376
          protocol: TCP

Using hardcoded delay

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: neg-demo-app # Label for the Deployment
  name: neg-demo-app # Name of Deployment
spec:
  minReadySeconds: 60 # Number of seconds to wait after a Pod is created and its status is Ready
  selector:
    matchLabels:
      run: neg-demo-app
  template: # Pod template
    metadata:
      labels:
        run: neg-demo-app # Labels Pods from this Deployment
    spec: # Pod specification; each Pod created by this Deployment has this specification
      containers:
      - image: registry.k8s.io/serve_hostname:v1.4 # Application to run in Deployment's Pods
        name: hostname # Container name
      # Note: The following line is necessary only on clusters running GKE v1.11 and lower.
      # For details, see https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#align_rollouts
        ports:
        - containerPort: 9376
          protocol: TCP
      terminationGracePeriodSeconds: 60 # Number of seconds to wait for connections to terminate before shutting down Pods

In this Deployment, each container runs an HTTP server. The HTTP server returns the hostname of the application server (the name of the Pod on which the server runs) as a response.

Save this manifest as neg-demo-app.yaml, then create the Deployment:

kubectl apply -f neg-demo-app.yaml

Create a Service for a container-native load balancer

After you have created a Deployment, you need to group its Pods into a Service.

The following sample Service, neg-demo-svc, targets the sample Deployment that you created in the previous section:

apiVersion: v1
kind: Service
metadata:
  name: neg-demo-svc # Name of Service
spec: # Service's specification
  type: ClusterIP
  selector:
    run: neg-demo-app # Selects Pods labelled run: neg-demo-app
  ports:
  - name: http
    port: 80 # Service's port
    protocol: TCP
    targetPort: 9376

The load balancer is not created until you create an Ingress for the Service.

Save this manifest as neg-demo-svc.yaml, then create the Service:

kubectl apply -f neg-demo-svc.yaml

Create an Ingress for the Service

The following sample Ingress, neg-demo-ing, targets the Service that you created:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: neg-demo-ing
spec:
  defaultBackend:
    service:
      name: neg-demo-svc # Name of the Service targeted by the Ingress
      port:
        number: 80 # Should match the port used by the Service

Save this manifest as neg-demo-ing.yaml, then create the Ingress:

kubectl apply -f neg-demo-ing.yaml

Upon creating the Ingress, an Application Load Balancer is created in the project, and Network Endpoint Groups(NEGs) are created in each zone in which the cluster runs. The endpoints in the NEG and the endpoints of the Service are kept in sync.

Verify the Ingress

After you have deployed a workload, grouped its Pods into a Service, and created an Ingress for the Service, you should verify that the Ingress has provisioned the container-native load balancer successfully.

Retrieve the status of the Ingress:

kubectl describe ingress neg-demo-ing

The output includes ADD and CREATE events:

Events:
Type     Reason   Age                From                     Message
----     ------   ----               ----                     -------
Normal   ADD      16m                loadbalancer-controller  default/neg-demo-ing
Normal   Service  4s                 loadbalancer-controller  default backend set to neg-demo-svc:32524
Normal   CREATE   2s                 loadbalancer-controller  ip: 192.0.2.0

Test the load balancer

The following sections explain how you can test the functionality of a container-native load balancer.

Visit Ingress IP address

Wait several minutes for the Application Load Balancer to be configured.

You can verify that the container-native load balancer is functioning by visiting the Ingress' IP address.

To get the Ingress IP address, run the following command:

kubectl get ingress neg-demo-ing

In the command output, the Ingress' IP address is displayed in the ADDRESS column. Visit the IP address in a web browser.

Check backend service health status

You can also get the health status of the load balancer's backend service.

Get a list of the backend services running in your project:
```
gcloud compute backend-services list
```
Record the name of the backend service that includes the name of the Service, such as neg-demo-svc.
Get the health status of the backend service:
```
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global
```
Replace BACKEND_SERVICE_NAME with the name of the backend service.

Test the Ingress

Another way you can test that the load balancer functions as expected is by scaling the sample Deployment, sending test requests to the Ingress, and verifying that the correct number of replicas respond.

Scale the neg-demo-app Deployment from one instance to two instances:
```
kubectl scale deployment neg-demo-app --replicas 2
```
This command might take several minutes to complete.

Verify that the rollout is complete:

kubectl get deployment neg-demo-app

The output should include two available replicas:

NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
neg-demo-app   2         2         2            2           26m

Get the Ingress IP address:
```
kubectl describe ingress neg-demo-ing
```
If this command returns a 404 error, wait a few minutes for the load balancer to start, then try again.
Count the number of distinct responses from the load balancer:
```
for i in `seq 1 100`; do \
  curl --connect-timeout 1 -s IP_ADDRESS && echo; \
done  | sort | uniq -c
```
Replace IP_ADDRESS with the Ingress IP address.

The output is similar to the following:
```
44 neg-demo-app-7f7dfd7bc6-dcn95
56 neg-demo-app-7f7dfd7bc6-jrmzf
```
In this output, the number of distinct responses is the same as the number of replicas, which indicates that all backend Pods are serving traffic.

Clean up

After completing the tasks on this page, follow these steps to remove the resources to prevent unwanted charges incurring on your account:

Delete the cluster

gcloud

gcloud container clusters delete neg-demo-cluster

Console

Go to the Google Kubernetes Engine page in the Google Cloud console.

Go to Google Kubernetes Engine
Select neg-demo-cluster and click Delete.
When prompted to confirm, click Delete.

Troubleshooting

Use the techniques below to verify your networking configuration. The following sections explain how to resolve specific issues related to container-native load balancing.

See the load balancing documentation for how to list your network endpoint groups.
You can find the name and zones of the NEG that corresponds to a service in the neg-status annotation of the service. Get the Service specification with:
```
kubectl get svc SVC_NAME -o yaml
```
The metadata:annotations:cloud.google.com/neg-status annotation lists the name of service's corresponding NEG and the zones of the NEG.
You can check the health of the backend service that corresponds to a NEG with the following command:
```
gcloud compute backend-services --project PROJECT_NAME \
    get-health BACKEND_SERVICE_NAME --global
```
The backend service has the same name as its NEG.
To print a service's event logs:
```
kubectl describe svc SERVICE_NAME
```
The service's name string includes the name and namespace of the corresponding GKE Service.

Cannot create a cluster with alias IPs

Symptoms

When you attempt to create a cluster with alias IPs, you might encounter the following error:

ResponseError: code=400, message=IP aliases cannot be used with a legacy network.

Potential causes

You encounter this error if you attempt to create a cluster with alias IPs that also uses a legacy network.

Resolution

Ensure that you don't create a cluster with alias IPs and a legacy network enabled simultaneously. For more information about using alias IPs, refer to Create a VPC-native cluster.

Traffic does not reach endpoints

Symptoms

502/503 errors or rejected connections.

Potential causes

New endpoints generally become reachable after attaching them to the load balancer, provided that they respond to health checks. You might encounter 502 errors or rejected connections if traffic cannot reach the endpoints.

502 errors and rejected connections can also be caused by a container that doesn't handle SIGTERM. If a container doesn't explicitly handle SIGTERM, it immediately terminates and stops handling requests. The load balancer continues to send incoming traffic to the terminated container, leading to errors.

The container native load balancer only has one backend endpoint. During a rolling update, the old endpoint gets deprogrammed before the new endpoint gets programmed.

Backend Pod(s) are deployed into a new zone for the first time after a container native load balancer is provisioned. Load balancer infrastructure is programmed in a zone when there is at least one endpoint in the zone. When a new endpoint is added to a zone, load balancer infrastructure is programmed and causes service disruptions.

Resolution

Configure containers to handle SIGTERM and continue responding to requests throughout the termination grace period (30 seconds by default). Configure Pods to begin failing health checks when they receive SIGTERM. This signals the load balancer to stop sending traffic to the Pod while endpoint deprograming is in progress.

If your application does not gracefully shut down and stops responding to requests when receiving a SIGTERM, the preStop hook can be used to handle SIGTERM and keep serving traffic while endpoint deprograming is in progress.

lifecycle:
  preStop:
    exec:
      # if SIGTERM triggers a quick exit; keep serving traffic instead
      command: ["sleep","60"]

See the documentation on Pod termination.

If your load balancer backend only has one instance, please configure the roll out strategy to avoid tearing down the only instance before the new instance is fully programmed. For application pods managed by Deployment workload, this can be achieved by configuring rollout strategy with maxUnavailable parameter equal to 0.

strategy:
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

To troubleshoot traffic not reaching the endpoints, verify that firewall rules allow incoming TCP traffic to your endpoints in the 130.211.0.0/22 and 35.191.0.0/16 ranges. To learn more, refer to Adding Health Checks in the Cloud Load Balancing documentation.

View the backend services in your project. The name string of the relevant backend service includes the name and namespace of the corresponding GKE Service:

gcloud compute backend-services list

Retrieve the backend health status from the backend service:

gcloud compute backend-services get-health BACKEND_SERVICE_NAME

If all backends are unhealthy, your firewall, Ingress, or Service might be misconfigured.

If some backends are unhealthy for a short period of time, network programming latency might be the cause.

If some backends do not appear in the list of backend services, programming latency might be the cause. You can verify this by running the following command, where NEG_NAME is the name of the backend service. (NEGs and backend services share the same name):

gcloud compute network-endpoint-groups list-network-endpoints NEG_NAME

Check if all the expected endpoints are in the NEG.

If you have a small number of backends (for example, 1 Pod) selected by a container native load balancer, consider increasing the number of replicas and distribute the backend Pods across all zones that the GKE cluster spans. This will ensure the underlying load balancer infrastructure is fully programmed. Otherwise, consider restricting the backend Pods to a single zone.

If you configure a network policy for the endpoint, make sure that ingress from Proxy-only subnet is allowed.

Stalled rollout

Symptoms

Rolling out an updated Deployment stalls, and the number of up-to-date replicas does not match the desired number of replicas.

Potential causes

The deployment's health checks are failing. The container image might be bad or the health check might be misconfigured. The rolling replacement of Pods waits until the newly started Pod passes its Pod readiness gate. This only occurs if the Pod is responding to load balancer health checks. If the Pod does not respond, or if the health check is misconfigured, the readiness gate conditions can't be met and the rollout can't continue.

If you're using kubectl 1.13 or higher, you can check the status of a Pod's readiness gates with the following command:

kubectl get pod POD_NAME -o wide

Check the READINESS GATES column.

This column doesn't exist in kubectl 1.12 and lower. A Pod that is marked as being in the READY state may have a failed readiness gate. To verify this, use the following command:

kubectl get pod POD_NAME -o yaml

The readiness gates and their status are listed in the output.

Resolution

Verify that the container image in your Deployment's Pod specification is functioning correctly and is able to respond to health checks. Verify that the health checks are correctly configured.

Degraded mode errors

Symptoms

Starting from GKE version 1.29.2-gke.1643000, you might get the following warnings on your service in the Logs Explorer when NEGs are updated:

Entering degraded mode for NEG <service-namespace>/<service-name>-<neg-name>... due to sync err: endpoint has missing nodeName field

Potential causes

These warnings indicate GKE has detected endpoint misconfigurations during an NEG update based on EndpointSlice objects, triggering a more in-depth calculation process called degraded mode. GKE continues to update NEGs on a best-effort basis, by either correcting the misconfiguration or excluding the invalid endpoints from the NEG updates.

Some common errors are:

endpoint has missing pod/nodeName field
endpoint corresponds to an non-existing pod/node
endpoint information for attach/detach operation is incorrect

Resolution

Typically, transitory states cause these events and they are fixed on their own. However, events caused by misconfigurations in custom EndpointSlice objects remain unresolved. To understand the misconfiguration, examine the EndpointSlice objects corresponding to the service:

kubectl get endpointslice -l kubernetes.io/service-name=<service-name>

Validate each endpoint based on the error in the event.

To resolve the issue, you must manually modify the EndpointSlice objects. The update triggers NEGs to update again. Once the misconfiguration no longer exists, the output is similar to the following:

NEG <service-namespace>/<service-name>-<neg-name>... is no longer in degraded mode

Known issues

Container-native load balancing on GKE has the following known issues:

NEG readiness gate race condition

Under certain conditions the readiness gates may return a ready state "false positive", before the ingress health check reports a healthy state, generating error events like the following on the Ingress object:

NEG is not attached to any BackendService with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

Symptoms

This issue causes the Load Balancer to report a 503 error (failed_to_pick_backend) to traffic while GKE is performing the rolling update on the deployment workload.

Causes

While the GKE NEG Controller relies on the Compute Engine NEG health check information to report if the endpoint is healthy, consider the following sequence of events:

A new Pod from a rolling update gets created.
GKE NEG Controller adds this new Pod IP address to the Network Endpoint Group.
GKE NEG Controller requests for health status on the newly added endpoint.
Compute Engine NEG Service does not have health information yet and returns an empty status.
GKE NEG Controller assumes empty status means no health check is configured and marks the Pod as Ready.
GKE removes the old Pod thinking the new one is ready to serve traffic.
If the new Pod is the only backend left for the load balancer, the load balancer returns a 503 Service Unavailable Response.
Once the Pod starts to pass its health check, the load balancer will start to return 200 OK Responses as expected.

The GKE NEG Controller cannot differentiate between two different health check states: missing-health-check-because-not-attached-to-backend, and missing-health-check-because-health-check-is-not-yet-programmed.

Since NEG controller can't make this distinction, the GKE NEG Controller must assume that if none of the NEG's endpoints have health check information, that this NEG must not belong to any BackendService.

While this scenario is unlikely, having a relatively small number of Pods (e.g. 2) compared to the number of NEGs will increase the risk of this race condition. Remember, NEGs are created for each region, with one NEG per zone, which typically results in three NEGs.

The corollary is that if there is a relatively larger number of Pods, such that each NEG always has multiple Pods before the rolling update starts, this race condition is very unlikely to be triggered.

Resolutions:

The best way to prevent this race condition (and ultimately to prevent 503 Service Unavailable responses during rolling updates) is to have more backends in the Network Endpoint Group.

Ensure that the rolling update strategy is configured to ensure that at least 1 Pod is always running.

For example if only 2 Pods run normally, an example configuration could look like this:

strategy:
   type: RollingUpdate
   rollingUpdate:
     maxUnavailable: 0
     maxSurge: 1

The previous example is a suggestion. You must update the strategy based on several factors, such as the number of replicas.

Incomplete garbage collection

GKE garbage collects container-native load balancers every two minutes. If a cluster is deleted before load balancers are fully removed, you need to manually delete the load balancer's NEGs.

View the NEGs in your project by running the following command:

gcloud compute network-endpoint-groups list

In the command output, look for the relevant NEGs.

To delete a NEG, run the following command, replacing NEG_NAME with the name of the NEG:

gcloud compute network-endpoint-groups delete NEG_NAME

Align workload rollouts with endpoint propagation

When you deploy a workload to your cluster, or when you update an existing workload, the container-native load balancer can take longer to propagate new endpoints than it takes to finish the workload rollout. The sample Deployment that you deploy in this guide uses two fields to align its rollout with the propagation of endpoints: terminationGracePeriodSeconds and minReadySeconds.

terminationGracePeriodSeconds allows the Pod to shut down gracefully by waiting for connections to terminate after a Pod is scheduled for deletion.

minReadySeconds adds a latency period after a Pod is created. You specify a minimum number of seconds for which a new Pod should be in Ready status, without any of its containers crashing, for the Pod to be considered available.

You should configure your workloads' minReadySeconds and terminationGracePeriodSeconds values to be 60 seconds or higher to ensure that the service is not disrupted due to workload rollouts.

terminationGracePeriodSeconds is available in all Pod specifications, and minReadySeconds is available for Deployments and DaemonSets.

To learn more about fine-tuning rollouts, refer to RollingUpdateStrategy.

`initialDelaySeconds` in Pod `readinessProbe` not respected

You might expect the initialDelaySeconds configuration in the Pod's readinessProbe to be respected by the container-native load balancer; however, readinessProbe is implemented by kubelet, and the initialDelaySeconds configuration controls the kubelet health check, not the container-native load balancer. Container-native load balancing has its own load balancing health check.

What's next

Learn more about NEGs.
Learn more about VPC-native clusters.

Container-native load balancing through Ingress Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Use container-native load balancing

Create a VPC-native cluster

Create a Deployment

Using Pod readiness feedback

Using hardcoded delay

Create a Service for a container-native load balancer

Create an Ingress for the Service

Verify the Ingress

Test the load balancer

Visit Ingress IP address

Check backend service health status

Test the Ingress

Clean up

Delete the cluster

gcloud

Console

Troubleshooting

Cannot create a cluster with alias IPs

Traffic does not reach endpoints

Stalled rollout

Degraded mode errors

Known issues

NEG readiness gate race condition

Symptoms

Causes

Resolutions:

Incomplete garbage collection

Align workload rollouts with endpoint propagation

initialDelaySeconds in Pod readinessProbe not respected

What's next

Container-native load balancing through Ingress

`initialDelaySeconds` in Pod `readinessProbe` not respected