Troubleshooting

Collecting Connect Agent logs

The Connect Agent is a Deployment, gke-connect-agent, that connects clusters to Google. It's typically installed in your cluster in the namespace gke-connect. Collecting logs from this Connect Agent can be useful for troubleshooting issues.

You can retrieve the Agent's logs by running the following command:

for ns in $(kubectl get ns -o jsonpath={.items..metadata.name} -l hub.gke.io/project); do
  echo "======= Logs $ns ======="
  kubectl logs -n $ns -l app=gke-connect-agent
done

To get information about every Connect Agent running in your project clusters:

kubectl describe deployment --all-namespaces -l app=gke-connect-agent

A successful connection should have entries similar to the below example:

2019/02/16 17:28:43.312056 dialer.go:244: dialer: dial: connected to gkeconnect.googleapis.com:443
2019/02/16 17:28:43.312279 tunnel.go:234: serve: opening egress stream...
2019/02/16 17:28:43.312439 tunnel.go:248: serve: registering endpoint="442223602236", shard="88d2bca5-f40a-11e8-983e-42010a8000b2" {"Params":{"GkeConnect":{"endpoint_class":1,"metadata":{"Metadata":{"Default":{"manifest_version":"234227867"}}}}}} ...
2019/02/16 17:28:43.312656 tunnel.go:259: serve: serving requests...

tls: oversized record errors

Symptom

You might encounter an error like this one:

... dialer: dial: connection to gkeconnect.googleapis.com:443 failed after
388.080605ms: serve: egress call failed: rpc error: code = Unauthenticated
desc = transport: oauth2: cannot fetch token: Post
https://oauth2.googleapis.com/token: proxyconnect tcp: tls: oversized record
received with length 20527
Possible causes

This can mean that the Connect Agent is trying to connect via HTTPS to an HTTP-only proxy. The Connect Agent only supports CONNECT-based HTTP proxies.

Resolution

You need to reconfigure your proxy environment variables to the following:

http_proxy=http://[PROXY_URL]:[PROXY_PORT]
https_proxy=http://[PROXY_URL]:[PROXY_PORT]

oauth2: cannot fetch token errors

Symptom

You might encounter an error like this one:

...  dialer: dial: connection to gkeconnect.googleapis.com:443 failed
after 388.080605ms: serve: egress call failed: rpc error: code =
Unauthenticated desc = transport: oauth2: cannot fetch token: Post
https://oauth2.googleapis.com/token: read tcp 192.168.1.40:5570->1.1.1.1:80
read: connection reset by peer
Possible causes

This can mean that the upstream HTTP proxy reset the connection, most likely because this particular URL is not allowed by your HTTP proxy. In the example above, the 1.1.1.1:80 is the HTTP proxy address.

Resolution

Check that your HTTP proxy allowlist includes the following URLs/domains:

gkeconnect.googleapis.com
oauth2.googleapis.com/token
www.googleapis.com/oauth2/v1/certs

PermissionDenied errors

Symptom

You might encounter an error like this one:

tunnel.go:250: serve: recv error: rpc error: code = PermissionDenied
desc = The caller does not have permission
dialer.go:210: dialer: dial: connection to gkeconnect.googleapis.com:443
failed after 335.153278ms: serve: receive request failed: rpc error:
code = PermissionDenied desc = The caller does not have permission
dialer.go:150: dialer: connection done: serve: receive request failed:
rpc error: code = PermissionDenied desc = The caller does not have permission
dialer.go:228: dialer: backoff: 1m14.1376766s
Possible causes

This can mean that you have not bound the required Identity and Access Management (IAM) role to the Google Cloud service account you created to authorize the Connect Agent to connect to Google. The Google Cloud service account requires the gkehub.connect IAM role.

This can also happen if you delete and recreate the Google Cloud service account with the same name. You also need to delete and recreate the IAM role binding in that case. Refer to Deleting and recreating service accounts for more information.

Resolution

Bind the gkehub.connect role to your service account (note that the gkehub.admin role does not have the proper permissions for connecting and is not meant to be used by service accounts).

For example, for a project called my-project and a Google Cloud service account called gkeconnect@my-project.iam.gserviceaccount.com, you'd run the following command to bind the role to the service account:

gcloud projects add-iam-policy-binding my-project --member \
serviceAccount:gkeconnect@my-project.iam.gserviceaccount.com \
--role "roles/gkehub.connect"

You can view and verify the service account permissions have been applied to a Google Cloud service account by examining the output of the following command, and you should see the role: roles/gkehub.connect bound to the associated Google Cloud service Account.

gcloud projects get-iam-policy my-project

Error when binding IAM role to the Google Cloud service account

Symptom

You might encounter an error like this one:

ERROR: (gcloud.projects.add-iam-policy-binding) PERMISSION_DENIED:
Service Management API has not been used in project [PROJECT_ID] before or it
is disabled. Enable it by visiting
https://console.developers.google.com/apis/api/servicemanagement.googleapis.com/overview?project=[PROJECT_ID]
then retry. If you enabled this API recently, wait a few minutes for the
action to propagate to our systems and retry.
Possible causes

You might not have the IAM permissions to run the command gcloud projects add-iam-policy-binding.

Resolution

You need to have the permission resourcemanager.projects.setIamPolicy If you have the Project IAM Admin, Owner or Editor roles, you should be able to run the command. If an internal security policy prohibits you from running the command, check with your administrator.

Error from invalid Service Account key

Symptom

You might encounter an error like this one:

2020/05/08 01:22:21.435104 environment.go:214: Got ExternalID 3770f509-b89b-48c4-96e0-860bb70b3a58 from namespace kube-system.
2020/05/08 01:22:21.437976 environment.go:485: Using GCP Service Account key
2020/05/08 01:22:21.438140 gkeconnect_agent.go:50: error creating kubernetes connect agent: failed to get tunnel config: unexpected end of JSON input
Possible causes

These logs indicate that the Connect Agent was provided with an invalid Service Account key during installation.

Resolution

Create a new JSON file containing Service Account credentials and then follow the steps to Register a cluster again to reinstall the Connect Agent.

Error from expired Service Account key

Symptom

You might encounter an error like this one:

2020/05/08 01:22:21.435104 dialer.go:277: dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 37.901608ms:
serve: egress call failed: rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_grant","error_description":"Invalid JWT Signature."}
Possible causes

These logs indicate that the Connect Agent was dialing Connect with a problematic Service Account key which is not acceptable by Google anthentication service. The service account key file might be corrupted or the key has expired.

To make sure the key has expired, you can find the "key expiration date" on a service account's detail page in the Google Cloud Console. In some cases, your project/org may have some special policy that a service account can only have a short lifetime by default.

Resolution

Create a new JSON file containing Service Account credentials and then follow the steps to Register a cluster again to reinstall the Connect Agent.

Error from skewed system clock

Symptom

You might encounter an error like this one:

acceptCall: failed to parse token in req [rpc_id=1]: Token used before issued [rpc_id=1]
Possible causes

The log message usually indicates that there is a clock skew on the cluster. The cluster-issued token has an out-of-sync timestamp and thus the token is rejected.

Resolution

To see if the clock is not properly synced. you can run the date command on your cluster and compare it with the standard time. Usually a few seconds drift will cause this problem. To resolve this issue, please re-sync your cluster's clock.

Unable to see workloads in Cloud Console

Symptoms

In the Connect Agent logs, you might observe the following errors:

"https://10.0.10.6:443/api/v1/nodes" YYYY-MM-DDTHH mm:ss.sssZ http.go:86: GET
"https://10.0.10.6:443/api/v1/pods" YYYY-MM-DDTHH mm:ss.sssZ http.go:139:
Response status: "403 Forbidden" YYYY-MM-DDTHH mm:ss.sssZ http.go:139:
Response status: "403 Forbidden"`
Possible causes

These logs indicate that Google Cloud is attempting to access the cluster using the credentials you provided during registration. 403 errors indicate that the credentials don't have the permissions required to access the cluster.

Resolution

Check that the token and the account it's bound to, and be sure that it has the appropriate permissions on the cluster.

Context deadline exceeded

Symptom

You might encounter an error like this one:

2019/03/06 21:08:43.306625 dialer.go:235: dialer: dial: connecting to gkeconnect.googleapis.com:443...
2019/03/06 21:09:13.306893 dialer.go:240: dialer: dial: unable to connect to gkeconnect.googleapis.com:443: context deadline exceeded
2019/03/06 21:09:13.306943 dialer.go:183: dialer: connection done: context deadline exceeded
Possible causes

This error indicates a low-level TCP networking issue where the Connect Agent cannot talk with gkeconnect.googleapis.com.

Resolution

Verify that Pod workloads within this cluster can resolve and have outbound connectivity to gkeconnect.googleapis.com on port 443.

Agent connection fails intermittently

Symptoms

In the Connect Agent logs, you might observe the following errors:

2020/10/06 18:02:34.409749 dialer.go:277: dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 8m0.790286282s: serve: receive request failed: rpc error: code = Unavailable desc = transport is closing
2020/10/06 18:02:34.416618 dialer.go:207: dialer: connection done: serve: receive request failed: rpc error: code = Unavailable desc = transport is closing
2020/10/06 18:02:34.416722 dialer.go:295: dialer: backoff: 978.11948ms
2020/10/06 18:02:34.410097 tunnel.go:651: sendResponse: EOF [rpc_id=52]
2020/10/06 18:02:34.420077 tunnel.go:651: sendResponse: EOF [rpc_id=52]
2020/10/06 18:02:34.420204 tunnel.go:670: sendHalfClose: EOF [rpc_id=52]
2020/10/06 18:02:34.401412 tunnel.go:670: sendHalfClose: EOF [rpc_id=53]
Possible causes

The connection to Connect closes when Connect Agent does not have enough resources, for example on smaller AWS EC2 instances such as t3.medium.

Resolution

Enable T3 unlimited or use an instance type with more resources for your node pools.

Hub cannot access the project

Symptoms

During some Hub operations (usually cluster registration) you might observe an error similar to the following:

ERROR: (gcloud.container.hub.memberships.register) failed to initialize Default Feature
"authorizer", the Hub service account may not have access to your project
Possible causes

Hub's default service account, gcp-sa-gkehub, can accidentally become unbound from a project. The Hub Service Agent is an IAM role that grants the service account the permissions to manage cluster resources. If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from registering clusters and other cluster operations.

You can check to see if the service account has been removed from your project using gcloud tool or the Cloud Console. If the command or the dashboard do not display gcp-sa-gkehub among your service accounts, the service account has become unbound.

gcloud

Run the following command:

gcloud projects get-iam-policy PROJECT_NAME

where PROJECT_NAME is the name of the project where you are trying register the cluster.

Console

Visit the IAM & admin page in Cloud Console.

Resolution

If you removed the Hub Service Agent role binding, run the following commands to restore the role binding:

PROJECT_NUMBER=$(gcloud projects describe PROJECT_NAME --format "value(projectNumber)")
gcloud projects add-iam-policy-binding PROJECT_NAME \
  --member "serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-gkehub.iam.gserviceaccount.com" \
  --role roles/gkehub.serviceAgent

To confirm that the role binding was granted:

gcloud projects get-iam-policy PROJECT_NAME

If you see the service account name along with the gkehub.serviceAgent role, the role binding has been granted. For example:

- members:
  - serviceAccount:service-1234567890@gcp-sa-gkehub.iam.gserviceaccount.com
  role: roles/gkehub.serviceAgent

Error in registering a GKE cluster from a different project than Hub

Symptoms

While registering a GKE cluster from a project different than the Hub project, you might observe an error similar to the following:

ERROR: (gcloud.container.hub.memberships.register) hub default service account
does not have access to the GKE cluster project for //container.googleapis.com/v1/projects/my-project/zones/zone-a/clusters/my-cluster
Possible causes

The Hub default Service Account does not have the required permissions in the GKE cluster's project.

Resolution

Grant the Hub default Service Account the required permissions before registering the cluster.

Error when disabling the Hub API

Symptoms

When trying to disable the Hub API (gkehub.googleapis.com), you may encounter an error similar to the following:

Not ready to deactivate the service on this project; ensure there are no more resources managed by this service.
Possible causes

There are still clusters registered to Google Cloud (memberships) or environ-level features enabled in this project. All memberships or features must be unregistered or disabled before the API can be disabled.

  • To view your current registered clusters, follow the instructions in Viewing registered clusters

  • To see all active environ-level features for your project:

gcloud and cURL

$ curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    https://gkehub.googleapis.com/v1alpha1/projects/PROJECT_NAME/locations/global/features

where PROJECT_NAME is the name of the project where you are trying to disable the Hub API.

Cloud Console

If you have Anthos enabled in your project, visit the features overview page page in Cloud Console. Features listed as ENABLED are active environ-level features.

Resolution

First, unregister any clusters still registered to your project environ. All clusters must be unregistered before some features can be disabled.

Once you have done this, disable all environ-level features. Currently, this is only possible with the Hub REST API.

  1. Disable environ-level features that you have enabled for your project

    $ gcloud alpha container hub FEATURE_COMMAND disable
    
  2. Disable feature authorizer and metering, which are enabled by default.

    $ curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
        -X "DELETE" \
        https://gkehub.googleapis.com/v1alpha1/projects/PROJECT_NAME/locations/global/features/FEATURE
    

    where FEATURE is the name of the feature to disable (such as authorizer or metering).

Missing cluster permissions when registering a cluster

Symptom:

While trying to register a cluster with a user account or Google Cloud service account, you may get an error similar to the following:

ERROR: (gcloud.container.hub.memberships.register) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/my-project/zones/zone-a/clusters/my-cluster"
Possible cause:

The account that is trying to register the cluster does not have the required cluster-admin role-based access control (RBAC) role in the cluster.

Resolution:

Grant the cluster-admin RBAC role to the account before registering the cluster.

Getting additional help

You can file a ticket with Google Cloud support for Connect, by performing the following steps:

  1. File a case with Google Cloud Support.
  2. Follow the instructions in Collecting Connect Agent logs to save the Connect logs.
  3. Attach the Connect Agent logs to your case.