This page shows you how to resolve issues related to security configurations in your Google Kubernetes Engine (GKE) Autopilot and Standard clusters.
If you need additional assistance, reach out to Cloud Customer Care.RBAC and IAM
Authenticated IAM accounts fail to perform in-cluster actions
The following issue occurs when you try to perform an action in the cluster but GKE can't find an RBAC policy that authorizes the action. GKE attempts to find an IAM allow policy that grants the same permission. If that fails, you see an error message similar to the following:
Error from server (Forbidden): roles.rbac.authorization.k8s.io is forbidden:
User "example-account@example-project.iam.gserviceaccount.com" cannot list resource "roles" in
API group "rbac.authorization.k8s.io" in the namespace "kube-system": requires
one of ["container.roles.list"] permission(s).
To resolve this issue, use an RBAC policy to grant the permissions for the
attempted action. For example, to resolve the issue in the previous sample,
grant a Role that has the list
permission on roles
objects in the kube-system
namespace. For instructions, see
Authorize actions in clusters using role-based access control.
Workload Identity Federation for GKE
Pod can't authenticate to Google Cloud
If your application can't authenticate to Google Cloud, make sure that the following settings are configured properly:
Check that you have enabled the IAM Service Account Credentials API in the project containing the GKE cluster.
Confirm that Workload Identity Federation for GKE is enabled on the cluster by verifying that it has a workload identity pool set:
gcloud container clusters describe CLUSTER_NAME \ --format="value(workloadIdentityConfig.workloadPool)"
Replace
CLUSTER_NAME
with the name of your GKE cluster.If you haven't already specified a default zone or region for
gcloud
, you might also need to specify a--region
or--zone
flag when running this command.Make sure that the GKE metadata server is configured on the node pool where your application is running:
gcloud container node-pools describe NODEPOOL_NAME \ --cluster=CLUSTER_NAME \ --format="value(config.workloadMetadataConfig.mode)"
Replace the following:
NODEPOOL_NAME
with the name of your nodepool.CLUSTER_NAME
with the name of your GKE cluster.
Verify that the Kubernetes service account is annotated correctly:
kubectl describe serviceaccount \ --namespace NAMESPACE KSA_NAME
Replace the following:
NAMESPACE
with your GKE cluster's namespace.KSA
with the name of your Kubernetes service account.
The expected output contains an annotation similar to the following:
iam.gke.io/gcp-service-account: GSA_NAME@PROJECT_ID.iam.gserviceaccount.com
Check that the IAM service account is configured correctly:
gcloud iam service-accounts get-iam-policy \ GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com
The expected output contains a binding similar to the following:
- members: - serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME] role: roles/iam.workloadIdentityUser
If you have a cluster network policy, you must allow egress to
127.0.0.1/32
on port988
for clusters running GKE versions prior to 1.21.0-gke.1000, or to169.254.169.252/32
on port988
for clusters running GKE version 1.21.0-gke.1000 and later. For clusters running GKE Dataplane V2, you must allow egress to169.254.169.254/32
on port80
.kubectl describe networkpolicy NETWORK_POLICY_NAME
Replace
NETWORK_POLICY_NAME
with the name of your GKE network policy.
IAM service account access denied
Pods might fail to access a resource with Workload Identity Federation for GKE immediately after adding IAM role bindings. Access failure is more likely to occur in deployment pipelines or in declarative Google Cloud configurations where resources like IAM allow policies, role bindings, and Kubernetes Pods are created together. The following error message appears in the Pod logs:
HTTP/403: generic::permission_denied: loading: GenerateAccessToken("SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com", ""): googleapi: Error 403: Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).
This error might be caused by access change propagation in IAM, which means that access changes like role grants take time to propagate across the system. For role grants, propagation usually takes about two minutes, but can sometimes take seven or more minutes. For more details, see Access change propagation.
To resolve this error, consider adding a delay before your Pods attempt to access Google Cloud resources after being created.
DNS Resolution issues
Some Google Cloud client libraries are configured to connect to the
GKE and Compute Engine metadata servers by resolving the DNS
name metadata.google.internal
; for these libraries, healthy in-cluster DNS
resolution is a critical dependency for your workloads to authenticate to
Google Cloud services.
How you detect this problem depends on the details of your deployed application, including its logging configuration. Look for error messages that:
- tell you to configure GOOGLE_APPLICATION_CREDENTIALS, or
- tell you that your requests to a Google Cloud service were rejected because the request had no credentials.
If you encounter problems with DNS resolution of metadata.google.internal
,
some Google Cloud client libraries can be instructed to skip DNS resolution by
setting the environment variable GCE_METADATA_HOST
to 169.254.169.254
:
apiVersion: v1
kind: Pod
metadata:
name: example-pod
namespace: default
spec:
containers:
- image: debian
name: main
command: ["sleep", "infinity"]
env:
- name: GCE_METADATA_HOST
value: "169.254.169.254"
This is the hardcoded IP address at which the metadata service is always available on Google Cloud compute platforms.
Supported Google Cloud libraries:
- Python
- Java
- Node.js
- Golang (Note, however, that the Golang client library already prefers to connect by IP, rather than DNS name).
Timeout errors at Pod start up
The GKE metadata server needs a few seconds before it can start accepting requests on a new Pod. Attempts to authenticate using Workload Identity Federation for GKE within the first few seconds of a Pod's life might fail for applications and Google Cloud client libraries configured with a short timeout.
If you encounter timeout errors try the following:
- Update the Google Cloud client libraries that your workloads use.
- Change the application code to wait a few seconds and retry.
Deploy an initContainer that waits until the GKE metadata server is ready before running the Pod's main container.
For example, the following manifest is for a Pod with an
initContainer
:apiVersion: v1 kind: Pod metadata: name: pod-with-initcontainer spec: serviceAccountName: KSA_NAME initContainers: - image: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine name: workload-identity-initcontainer command: - '/bin/bash' - '-c' - | curl -sS -H 'Metadata-Flavor: Google' 'http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token' --retry 30 --retry-connrefused --retry-max-time 60 --connect-timeout 3 --fail --retry-all-errors > /dev/null && exit 0 || echo 'Retry limit exceeded. Failed to wait for metadata server to be available. Check if the gke-metadata-server Pod in the kube-system namespace is healthy.' >&2; exit 1 containers: - image: gcr.io/your-project/your-image name: your-main-application-container
Workload Identity Federation for GKE fails due to control plane unavailability
The metadata server can't return the Workload Identity Federation for GKE when the cluster control plane is unavailable. Calls to the metadata server return status code 500.
A log entry might appear similar to the following in the Logs Explorer:
dial tcp 35.232.136.58:443: connect: connection refused
This behavior leads to unavailability of Workload Identity Federation for GKE.
The control plane might be unavailable on zonal clusters during cluster maintenance like rotating IPs, upgrading control plane VMs, or resizing clusters or node pools. See Choosing a regional or zonal control plane to learn about control plane availability. Switching to a regional cluster eliminates this issue.
Workload Identity Federation for GKE authentication fails in clusters using Istio
You might see errors similar to the following when your application starts and tries to communicate with an endpoint:
Connection refused (169.254.169.254:80)
Connection timeout
These errors can occur when your application tries to make a network connection
before the istio-proxy
container is ready. By default, Istio and Cloud Service Mesh
allow workloads to send requests as soon as the workloads start, regardless of
whether the service mesh proxy workload that intercepts and redirects traffic is
running. For Pods that use Workload Identity Federation for GKE, these initial requests that
happen before the proxy starts might not reach the GKE
metadata server. As a result, authentication to Google Cloud APIs fails.
If you don't configure your applications to retry the requests, your workloads
might fail.
To confirm that this issue is the cause of your errors, view your logs and check
if the istio-proxy
container has started successfully:
In the Google Cloud console, go to the Logs Explorer page.
In the query pane, enter the following query:
(resource.type="k8s_container" resource.labels.pod_name="POD_NAME" textPayload:"Envoy proxy is ready" OR textPayload:"ERROR_MESSAGE") OR (resource.type="k8s_pod" logName:"events" jsonPayload.involvedObject.name="POD_NAME")
Replace the following:
POD_NAME
: the name of the Pod with the affected workload.ERROR_MESSAGE
: the error that the application received (eitherconnection timeout
orconnection refused
).
Click Run query.
Review the output and check when the
istio-proxy
container became ready.In the following example, the application tried to make a gRPC call. However, because the
istio-proxy
container was still initializing, the application received aConnection refused
error. The timestamp next to theEnvoy proxy is ready
message indicates when theistio-proxy
container became ready for connection requests:2024-11-11T18:37:03Z started container istio-init 2024-11-11T18:37:12Z started container gcs-fetch 2024-11-11T18:37:42Z Initializing environment 2024-11-11T18:37:55Z Started container istio-proxy 2024-11-11T18:38:06Z StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: Connection refused (169.254.169.254:80) 2024-11-11T18:38:13Z Envoy proxy is ready
To resolve this issue, and prevent it from recurring, try either of the following per-workload configuration options:
Prevent your applications from sending requests until the proxy workload is ready. Add the following annotation to the
metadata.annotations
field in your Pod specification:proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
Configure Istio or Cloud Service Mesh to exclude the IP address of the GKE metadata server from redirection. Add the following annotation to the
metadata.annotations
field of your Pod specification:traffic.sidecar.istio.io/excludeOutboundIPRanges: 169.254.169.254/32
In open source Istio, you can optionally mitigate this issue for all Pods by setting one of the following global configuration options:
Exclude the GKE metadata server IP address from redirection: Update the
global.proxy.excludeIPRanges
global configuration option to add the169.254.169.254/32
IP address range.Prevent applications from sending requests until the proxy starts: Add the
global.proxy.holdApplicationUntilProxyStarts
global configuration option with a value oftrue
to your Istio configuration.
gke-metadata-server
Pod is crashing
The gke-metadata-server
system DaemonSet Pod facilitates Workload Identity Federation for GKE
on your nodes. The Pod uses memory resources proportional to the number of
Kubernetes service accounts in your cluster.
The following issue occurs when the resource usage of the gke-metadata-server
Pod exceeds its limits. The kubelet evicts the Pod with an out of memory error.
You might have this issue if your cluster has more than 3,000 Kubernetes
service accounts.
To identify the issue, do the following:
Find crashing
gke-metadata-server
Pods in thekube-system
namespace:kubectl get pods -n=kube-system | grep CrashLoopBackOff
The output is similar to the following:
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system gke-metadata-server-8sm2l 0/1 CrashLoopBackOff 194 16h kube-system gke-metadata-server-hfs6l 0/1 CrashLoopBackOff 1369 111d kube-system gke-metadata-server-hvtzn 0/1 CrashLoopBackOff 669 111d kube-system gke-metadata-server-swhbb 0/1 CrashLoopBackOff 30 136m kube-system gke-metadata-server-x4bl4 0/1 CrashLoopBackOff 7 15m
Describe the crashing Pod to confirm that the cause was an out-of-memory eviction:
kubectl describe pod POD_NAME --namespace=kube-system | grep OOMKilled
Replace
POD_NAME
with the name of the Pod to check.
To restore functionality to the GKE metadata server, reduce the number of service accounts in your cluster to less than 3,000.
Workload Identity Federation for GKE fails to enable with DeployPatch failed error message
GKE uses the Google Cloud-managed
Kubernetes Engine Service Agent
to facilitate Workload Identity Federation for GKE in your clusters. Google Cloud automatically
grants this service agent the Kubernetes Engine Service Agent role
(roles/container.serviceAgent
) on your project when you enable the
Google Kubernetes Engine API.
If you try to enable Workload Identity Federation for GKE on clusters in a project where the service agent doesn't have the Kubernetes Engine Service Agent role, the operation fails with an error message similar to the following:
Error waiting for updating GKE cluster workload identity config: DeployPatch failed
To resolve this issue, try the following steps:
Check whether the service agent exists in your project and is configured correctly:
gcloud projects get-iam-policy PROJECT_ID \ --flatten=bindings \ --filter=bindings.role=roles/container.serviceAgent \ --format="value[delimiter='\\n'](bindings.members)"
Replace
PROJECT_ID
with your Google Cloud project ID.If the service agent is configured correctly, the output shows the full identity of the service agent:
serviceAccount:service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com
If the output doesn't display the service agent, you must grant it the Kubernetes Engine Service Agent role. To grant this role, complete the following steps.
Get your Google Cloud project number:
gcloud projects describe PROJECT_ID \ --format="value(projectNumber)"
The output is similar to the following:
123456789012
Grant the service agent the role:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com \ --role=roles/container.serviceAgent \ --condition=None
Replace
PROJECT_NUMBER
with your Google Cloud project number.Try to enable Workload Identity Federation for GKE again.