Troubleshoot GKE authentication issues

Autopilot Standard

This page shows you how to resolve issues related to security configurations in your Google Kubernetes Engine (GKE) Autopilot and Standard clusters.

If you need additional assistance, reach out to Cloud Customer Care.

RBAC and IAM

Authenticated IAM accounts fail to perform in-cluster actions

The following issue occurs when you try to perform an action in the cluster but GKE can't find an RBAC policy that authorizes the action. GKE attempts to find an IAM allow policy that grants the same permission. If that fails, you see an error message similar to the following:

Error from server (Forbidden): roles.rbac.authorization.k8s.io is forbidden:
User "example-account@example-project.iam.gserviceaccount.com" cannot list resource "roles" in
API group "rbac.authorization.k8s.io" in the namespace "kube-system": requires
one of ["container.roles.list"] permission(s).

To resolve this issue, use an RBAC policy to grant the permissions for the attempted action. For example, to resolve the issue in the previous sample, grant a Role that has the list permission on roles objects in the kube-system namespace. For instructions, see Authorize actions in clusters using role-based access control.

Workload Identity Federation for GKE

Pod can't authenticate to Google Cloud

If your application can't authenticate to Google Cloud, make sure that the following settings are configured properly:

Check that you have enabled the IAM Service Account Credentials API in the project containing the GKE cluster.

Enable IAM Credentials API
Confirm that Workload Identity Federation for GKE is enabled on the cluster by verifying that it has a workload identity pool set:
```
gcloud container clusters describe CLUSTER_NAME \
    --format="value(workloadIdentityConfig.workloadPool)"
```
Replace CLUSTER_NAME with the name of your GKE cluster.

If you haven't already specified a default zone or region for gcloud, you might also need to specify a --region or --zone flag when running this command.
Make sure that the GKE metadata server is configured on the node pool where your application is running:
```
gcloud container node-pools describe NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --format="value(config.workloadMetadataConfig.mode)"
```
Replace the following:
- NODEPOOL_NAME with the name of your nodepool.
- CLUSTER_NAME with the name of your GKE cluster.
Verify that the Kubernetes service account is annotated correctly:
```
kubectl describe serviceaccount \
    --namespace NAMESPACE KSA_NAME
```
Replace the following:
- NAMESPACE with your GKE cluster's namespace.
- KSA with the name of your Kubernetes service account.
The expected output contains an annotation similar to the following:
```
iam.gke.io/gcp-service-account: GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
```

Check that the IAM service account is configured correctly:

gcloud iam service-accounts get-iam-policy \
    GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com

The expected output contains a binding similar to the following:

- members:
  - serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]
  role: roles/iam.workloadIdentityUser

If you have a cluster network policy, you must allow egress to 127.0.0.1/32 on port 988 for clusters running GKE versions prior to 1.21.0-gke.1000, or to 169.254.169.252/32 on port 988 for clusters running GKE version 1.21.0-gke.1000 and later. For clusters running GKE Dataplane V2, you must allow egress to 169.254.169.254/32 on port 80.
```
kubectl describe networkpolicy NETWORK_POLICY_NAME
```
Replace NETWORK_POLICY_NAME with the name of your GKE network policy.

DNS Resolution issues

Some Google Cloud client libraries are configured to connect to the GKE and Compute Engine metadata servers by resolving the DNS name metadata.google.internal; for these libraries, healthy in-cluster DNS resolution is a critical dependency for your workloads to authenticate to Google Cloud services.

How you detecting this problem depends on details of your deployed application, including its logging configuration. Look for error messages that:

tell you to configure GOOGLE_APPLICATION_CREDENTIALS, or
tell you that your requests to a Google Cloud service were rejected because the request had no credentials.

If you encounter problems with DNS resolution of metadata.google.internal, some Google Cloud client libraries can be instructed to skip DNS resolution by setting the environment variable GCE_METADATA_HOST to 169.254.169.254:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
  namespace: default
spec:
  containers:
  - image: debian
    name: main
    command: ["sleep", "infinity"]
    env:
    - name: GCE_METADATA_HOST
      value: "169.254.169.254"

This is the hardcoded IP address at which the metadata service is always available on Google Cloud compute platforms.

Supported Google Cloud libraries:

Python
Java
Node.js
Golang (Note, however, that the Golang client library already prefers to connect by IP, rather than DNS name).

Timeout errors at Pod start up

The GKE metadata server needs a few seconds before it can start accepting requests on a new Pod. Attempts to authenticate using Workload Identity Federation for GKE within the first few seconds of a Pod's life might fail for applications and Google Cloud client libraries configured with a short timeout.

If you encounter timeout errors try the following:

Update the Google Cloud client libraries that your workloads use.
Change the application code to wait a few seconds and retry.

Deploy an initContainer that waits until the GKE metadata server is ready before running the Pod's main container.

For example, the following manifest is for a Pod with an initContainer:

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-initcontainer
spec:
  serviceAccountName: KSA_NAME
  initContainers:
  - image:  gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
    name: workload-identity-initcontainer
    command:
    - '/bin/bash'
    - '-c'
    - |
      curl -sS -H 'Metadata-Flavor: Google' 'http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token' --retry 30 --retry-connrefused --retry-max-time 60 --connect-timeout 3 --fail --retry-all-errors > /dev/null && exit 0 || echo 'Retry limit exceeded. Failed to wait for metadata server to be available. Check if the gke-metadata-server Pod in the kube-system namespace is healthy.' >&2; exit 1
  containers:
  - image: gcr.io/your-project/your-image
    name: your-main-application-container

Workload Identity Federation for GKE fails due to control plane unavailability

The metadata server can't return the Workload Identity Federation for GKE when the cluster control plane is unavailable. Calls to the metadata server return status code 500.

A log entry might appear similar to the following in the Logs Explorer:

dial tcp 35.232.136.58:443: connect: connection refused

This behavior leads to unavailability of Workload Identity Federation for GKE.

The control plane might be unavailable on zonal clusters during cluster maintenance like rotating IPs, upgrading control plane VMs, or resizing clusters or node pools. See Choosing a regional or zonal control plane to learn about control plane availability. Switching to a regional cluster eliminates this issue.

Workload Identity Federation for GKE authentication fails in clusters using Istio

If the GKE metadata server is blocked for any reason, Workload Identity Federation for GKE authentication fails.

If you are using Istio or Anthos Service Mesh, add the following Pod-level annotation to all workloads that use Workload Identity Federation for GKE to exclude the IP from redirection:

traffic.sidecar.istio.io/excludeOutboundIPRanges: 169.254.169.254/32

You can change the global.proxy.excludeIPRanges Istio ConfigMap key to do the same thing.

Alternatively, you could also add the following Pod-level annotation to all workloads that use Workload Identity Federation for GKE, to delay application container start until the sidecar is ready:

proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'

You can change the global.proxy.holdApplicationUntilProxyStarts Istio ConfigMap key to do the same thing.

`gke-metadata-server` Pod is crashing

The gke-metadata-server system DaemonSet Pod facilitates Workload Identity Federation for GKE on your nodes. The Pod uses memory resources proportional to the number of Kubernetes service accounts in your cluster.

The following issue occurs when the resource usage of the gke-metadata-server Pod exceeds its limits. The kubelet evicts the Pod with an out of memory error. You might have this issue if your cluster has more than 3,000 Kubernetes service accounts.

To identify the issue, do the following:

Find crashing gke-metadata-server Pods in the kube-system namespace:

kubectl get pods -n=kube-system | grep CrashLoopBackOff

The output is similar to the following:

NAMESPACE     NAME                        READY     STATUS             RESTARTS   AGE
kube-system   gke-metadata-server-8sm2l   0/1       CrashLoopBackOff   194        16h
kube-system   gke-metadata-server-hfs6l   0/1       CrashLoopBackOff   1369       111d
kube-system   gke-metadata-server-hvtzn   0/1       CrashLoopBackOff   669        111d
kube-system   gke-metadata-server-swhbb   0/1       CrashLoopBackOff   30         136m
kube-system   gke-metadata-server-x4bl4   0/1       CrashLoopBackOff   7          15m

Describe the crashing Pod to confirm that the cause was an out-of-memory eviction:
```
kubectl describe pod POD_NAME --namespace=kube-system | grep OOMKilled
```
Replace POD_NAME with the name of the Pod to check.

To restore functionality to the GKE metadata server, reduce the number of service accounts in your cluster to less than 3,000.

Workload Identity Federation for GKE fails to enable with DeployPatch failed error message

GKE uses the Google Cloud-managed Kubernetes Engine Service Agent to facilitate Workload Identity Federation for GKE in your clusters. Google Cloud automatically grants this service agent the Kubernetes Engine Service Agent role (roles/container.serviceAgent) on your project when you enable the Google Kubernetes Engine API.

If you try to enable Workload Identity Federation for GKE on clusters in a project where the service agent doesn't have the Kubernetes Engine Service Agent role, the operation fails with an error message similar to the following:

Error waiting for updating GKE cluster workload identity config: DeployPatch failed

To resolve this issue, try the following steps:

Check whether the service agent exists in your project and is configured correctly:
```
gcloud projects get-iam-policy PROJECT_ID \
    --flatten=bindings \
    --filter=bindings.role=roles/container.serviceAgent \
    --format="value[delimiter='\\n'](bindings.members)"
```
Replace PROJECT_ID with your Google Cloud project ID.

If the service agent is configured correctly, the output shows the full identity of the service agent:
```
serviceAccount:service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
```
If the output doesn't display the service agent, you must grant it the Kubernetes Engine Service Agent role. To grant this role, complete the following steps.

Get your Google Cloud project number:

gcloud projects describe PROJECT_ID \
    --format="value(projectNumber)"