Troubleshoot load balancing in GKE


This page shows you how to resolve issues related to load balancing in Google Kubernetes Engine (GKE) clusters using Service, Ingress, or Gateway resources.

If you need additional assistance, reach out to Cloud Customer Care.

BackendConfig not found

This error occurs when a BackendConfig for a Service port is specified in the Service annotation, but the actual BackendConfig resource couldn't be found.

To evaluate a Kubernetes event, run the following command:

kubectl get event

The following example output indicates your BackendConfig was not found:

KIND    ... SOURCE
Ingress ... loadbalancer-controller

MESSAGE
Error during sync: error getting BackendConfig for port 80 on service "default/my-service":
no BackendConfig for service port exists

To resolve this issue, ensure you have not created the BackendConfig resource in the wrong namespace or misspelled its reference in the Service annotation.

Ingress security policy not found

After the Ingress object is created, if the security policy isn't properly associated with the LoadBalancer Service, evaluate the Kubernetes event to see if there is a configuration mistake. If your BackendConfig specifies a security policy that does not exist, a warning event is periodically emitted.

To evaluate a Kubernetes event, run the following command:

kubectl get event

The following example output indicates your security policy was not found:

KIND    ... SOURCE
Ingress ... loadbalancer-controller

MESSAGE
Error during sync: The given security policy "my-policy" does not exist.

To resolve this issue, specify the correct security policy name in your BackendConfig.

Addressing 500 series errors with NEGs during workload scaling in GKE

Symptom:

When you use GKE provisioned NEGs for load balancing, you might experience 502 or 503 errors for the services during the workload scale down. 502 errors occur when Pods are terminated before existing connections close, while the 503 errors occur when traffic is directed to deleted Pods.

This issue can affect clusters if you are using GKE managed load balancing products that use NEGs, including Gateway, Ingress, and standalone NEGs. If you frequently scale your workloads, your cluster is at a higher risk of being affected.

Diagnosis:

Removing a Pod in Kubernetes without draining its endpoint and removing it from the NEG first leads to 500 series errors. To avoid issues during Pod termination, you must consider the order of operations. The following images display scenarios when BackendService Drain Timeout is unset and BackendService Drain Timeout is set with BackendConfig.

Scenario 1: BackendService Drain Timeout is unset.

The following image displays a scenario where the BackendService Drain Timeout is unset.

BackendService Drain Timeout is unset.

Scenario 2: BackendService Drain Timeout is set.

The following image displays a scenario where the BackendService Drain Timeout is set.

BackendService Drain Timeout is set.

The exact time the 500 series errors occur depends on the following factors:

  • NEG API detach latency: The NEG API detach latency represents the current time taken for the detach operation to finalize in Google Cloud. This is influenced by a variety of factors outside Kubernetes, including the type of load balancer and the specific zone.

  • Drain latency: Drain latency represents the time taken for the load balancer to start directing traffic away from a particular part of your system. Once drain is initiated, the load balancer stops sending new requests to the endpoint, however there is still a latency in triggering drain (Drain Latency) which can cause temporary 503 errors if the Pod no longer exists.

  • Health check configuration: More sensitive health check thresholds mitigate the duration of 503 errors as it can signal the load balancer to stop sending requests to endpoints even if the detach operation has not finished.

  • Termination grace period: The termination grace period determines the maximum amount of time a Pod is given to exit. However, a Pod can exit before the termination grace period completes. If a Pod takes longer than this period, the Pod is forced to exit at the end of this period. This is a setting on the Pod and needs to be configured in the workload definition.

Potential resolution:

To prevent those 5XX errors, apply the following settings. The timeout values are suggestive and you might need to adjust them for your specific application. The following section guides you through the customization process.

The following image displays how to keep the Pod alive with preStopHook:

BackendService Drain Timeout is set.

To avoid 500 series errors, perform the following steps:

  1. Set the BackendService Drain Timeout for your service to 1 minute.

  2. Extend the terminationGracePeriod on the Pod.

    Set the terminationGracePeriodSeconds on the Pod to 3.5 minutes. When combined with the recommended settings, this allows your Pods a 30 to 45 second window for a graceful shutdown after the Pod's endpoint has been removed from the NEG. If you require more time for the graceful shutdown, you can extend the grace period or follow the instructions mentioned in the Customize timeouts section.

    The following Pod manifest specifies a connection draining timeout of 210 seconds (3.5 minutes):

    spec:
      terminationGracePeriodSeconds: 210
      containers:
      - name: my-app
        ...
      ...
    
  3. Apply a preStopHook to all containers.

    Apply a preStopHook that will ensure the Pod is alive for 120 seconds longer while the Pod's endpoint is drained in the load balancer and the endpoint is removed from the NEG.

    spec:
      containers:
      - name: my-app
        ...
        lifecycle:
          preStopHook:
            exec:
              command: ["/bin/sh", "-c", "sleep 120s"]
      ...
    

Customize timeouts

To ensure Pod continuity and prevent 500 series errors, the Pod must be alive until the endpoint is removed from the NEG. Specially to prevent 502 and 503 errors, consider implementing a combination of timeouts and a preStopHook.

To keep the Pod alive longer during the shutdown process, add a preStopHook to the Pod. Run the preStopHook before a Pod is signaled to exit, so the preStopHook can be used to keep the Pod around until its corresponding endpoint is removed from the NEG.

To extend the duration that a Pod remains active during the shutdown process, insert a preStopHook into the Pod configuration as follows:

spec:
  containers:
  - name: my-app
    ...
    lifecycle:
      preStopHook:
        exec:
          command: ["/bin/sh", "-c", "sleep <latency time>"]

You can configure timeouts and related settings to manage the graceful shutdown of Pods during workload scale downs. You can adjust timeouts based on specific use cases. We recommend that you start with longer timeouts and reduce the duration as necessary. You can customize the timeouts by configuring timeout-related parameters and the preStopHook in the following ways:

Backend Service Drain Timeout

The Backend Service Drain Timeout parameter is unset by default and has no effect. If you set the Backend Service Drain Timeout parameter and activate it, the load balancer stops routing new requests to the endpoint and waits the timeout before terminating existing connections.

You can set the Backend Service Drain Timeout parameter by using the BackendConfig with Ingress, the GCPBackendPolicy with Gateway or manually on the BackendService with standalone NEGs. The timeout should be 1.5 to 2 times longer than the time it takes to process a request. This ensures if a request came in right before the drain was initiated, it will complete before the timeout completes. Setting the Backend Service Drain Timeout parameter to a value greater than 0 helps mitigate 503 errors because no new requests are sent to endpoints scheduled for removal. For this timeout to be effective, you must use it in conjunction with the preStopHook to ensure that the Pod remains active while the drain occurs. Without this combination, existing requests that didn't complete will receive a 502 error.

preStopHook time

The preStopHook must delay Pod shut down sufficiently for both Drain latency and backend service drain timeout to complete, ensuring proper connection drainage and endpoint removal from the NEG before the Pod is shut down.

For optimal results, ensure your preStopHook execution time is less than or equal to the sum of the Backend Service Drain Timeout and Drain Latency.

This can be calculated using the formula:

preStopHook >= Backend Service Drain Timeout + Drain Latency

We recommend setting the Drain Latency to 1 minute. If 500 errors persist, estimate the total occurrence duration and add double that time to the latency. This ensures that your Pod has enough time to drain gracefully before being removed from the service. You can adjust this value if it's too long for your specific use case.

Alternatively, you can estimate the timing by examining the deletion timestamp from the Pod and the timestamp when the endpoint was removed from the NEG in the Cloud Audit Logs.

Termination Grace Period parameter

You must configure the terminationGracePeriod parameter to allow sufficient time for the preStopHook to finish and for the Pod to complete a graceful shutdown.

By default, when not explicitly set, the terminationGracePeriod is 30 seconds. You can calculate the optimal terminationGracePeriod using the formula:

terminationGracePeriod >= preStopHook Time + Pod shutdown time

To define terminationGracePeriod within the Pod's configuration as follows:

  spec:
    terminationGracePeriodSeconds: <terminationGracePeriod>
    containers:
    - name: my-app
      ...
    ...

NEG not found when creating an Internal Ingress resource

The following error might occur when you create an internal Ingress in GKE:

Error syncing: error running backend syncing routine: googleapi: Error 404: The resource 'projects/PROJECT_ID/zones/ZONE/networkEndpointGroups/NEG' was not found, notFound

This error occurs because Ingress for internal Application Load Balancers requires Network Endpoint Groups (NEGs) as backends.

In Shared VPC environments or clusters with Network Policies enabled, add the annotation cloud.google.com/neg: '{"ingress": true}' to the Service manifest.

504 Gateway Timeout: upstream request timeout

The following error might occur when you access a Service from an internal Ingress in GKE:

HTTP/1.1 504 Gateway Timeout
content-length: 24
content-type: text/plain

upsteam request timeout

This error occurs because traffic sent to internal Application Load Balancers are proxied by envoy proxies in the proxy-only subnet range.

To allow traffic from the proxy-only subnet range, create a firewall rule on the targetPort of the Service.

Error 400: Invalid value for field 'resource.target'

The following error might occur when you access a Service from an internal Ingress in GKE:

Error syncing:LB_NAME does not exist: googleapi: Error 400: Invalid value for field 'resource.target': 'https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION_NAME/targetHttpProxies/LB_NAME. A reserved and active subnetwork is required in the same region and VPC as the forwarding rule.

To resolve this issue, create a proxy-only subnet.

Error during sync: error running load balancer syncing routine: loadbalancer does not exist

One of the following errors might occur when the GKE control plane upgrades or when you modify an Ingress object:

"Error during sync: error running load balancer syncing routine: loadbalancer
INGRESS_NAME does not exist: invalid ingress frontend configuration, please
check your usage of the 'kubernetes.io/ingress.allow-http' annotation."

Or:

Error during sync: error running load balancer syncing routine: loadbalancer LOAD_BALANCER_NAME does not exist:
googleapi: Error 400: Invalid value for field 'resource.IPAddress':'INGRESS_VIP'. Specified IP address is in-use and would result in a conflict., invalid

To resolve these issues, try the following steps:

  • Add the hosts field in the tls section of the Ingress manifest, then delete the Ingress. Wait five minutes for GKE to delete the unused Ingress resources. Then, recreate the Ingress. For more information, see The hosts field of an Ingress object.
  • Revert the changes you made to the Ingress. Then, add a certificate using an annotation or Kubernetes Secret.

External Ingress produces HTTP 502 errors

Use the following guidance to troubleshoot HTTP 502 errors with external Ingress resources:

  1. Enable logs for each backend service associated with each GKE Service that is referenced by the Ingress.
  2. Use status details to identify causes for HTTP 502 responses. Status details that indicate the HTTP 502 response originated from the backend require troubleshooting within the serving Pods, not the load balancer.

Unmanaged instance groups

You might experience HTTP 502 errors with external Ingress resources if your external Ingress uses unmanaged instance group backends. This issue occurs when all of the following conditions are met:

  • The cluster has a large total number of nodes among all node pools.
  • The serving Pods for one or more Services that are referenced by the Ingress are located on only a few nodes.
  • Services referenced by the Ingress use externalTrafficPolicy: Local.

To determine if your external Ingress uses unmanaged instance group backends, do the following:

  1. Go to the Ingress page in the Google Cloud console.

    Go to Ingress

  2. Click the name of your external Ingress.

  3. Click the name of the Load balancer. The Load balancing details page displays.

  4. Check the table in the Backend services section to determine if your external Ingress uses NEGs or instance groups.

To resolve this issue, use one of the following solutions:

  • Use a VPC-native cluster.
  • Use externalTrafficPolicy: Cluster for each Service referenced by the external Ingress. This solution causes you to lose the original client IP address in the packet's sources.
  • Use the node.kubernetes.io/exclude-from-external-load-balancers=true annotation. Add the annotation to the nodes or node pools that do not run any serving Pod for any Service referenced by any external Ingress or LoadBalancer Service in your cluster.

Use load balancer logs to troubleshoot

You can use internal passthrough Network Load Balancer logs and external passthrough Network Load Balancer logs to troubleshoot issues with load balancers and correlate traffic from load balancers to GKE resources.

Logs are aggregated per-connection and exported in near real time. Logs are generated for each GKE node involved in the data path of a LoadBalancer Service, for both ingress and egress traffic. Log entries include additional fields for GKE resources, such as: - Cluster name - Cluster location - Service name - Service namespace - Pod name - Pod namespace

Pricing

There are no additional charges for using logs. Based on how you ingest logs, standard pricing for Cloud Logging, BigQuery, or Pub/Sub apply. Enabling logs has no effect on the performance of the load balancer.

Use diagnostic tools to troubleshoot

The check-gke-ingress diagnostic tool inspects Ingress resources for common misconfigurations. You can use the check-gke-ingress tool in the following ways:

  • Run the gcpdiag command-line tool on your cluster. Ingress results appear in the check rule gke/ERR/2023_004 section.
  • Use the check-gke-ingress tool alone or as a kubectl plugin by following the instructions in check-gke-ingress.

What's next

If you need additional assistance, reach out to Cloud Customer Care.