Collecting Connect Agent logs
The Connect Agent is a Deployment, gke-connect-agent
, that connects
clusters to Google. It's typically installed in your cluster in the namespace
gke-connect
. Collecting logs from this Connect Agent
can be useful for troubleshooting issues.
You can retrieve the Agent's logs by running the following command:
for ns in $(kubectl get ns -o jsonpath={.items..metadata.name} -l hub.gke.io/project); do echo "======= Logs $ns =======" kubectl logs -n $ns -l app=gke-connect-agent done
To get information about every Connect Agent running in your project clusters:
kubectl describe deployment --all-namespaces -l app=gke-connect-agent
A successful connection should have entries similar to the below example:
2019/02/16 17:28:43.312056 dialer.go:244: dialer: dial: connected to gkeconnect.googleapis.com:443 2019/02/16 17:28:43.312279 tunnel.go:234: serve: opening egress stream... 2019/02/16 17:28:43.312439 tunnel.go:248: serve: registering endpoint="442223602236", shard="88d2bca5-f40a-11e8-983e-42010a8000b2" {"Params":{"GkeConnect":{"endpoint_class":1,"metadata":{"Metadata":{"Default":{"manifest_version":"234227867"}}}}}} ... 2019/02/16 17:28:43.312656 tunnel.go:259: serve: serving requests...
tls: oversized record
errors
- Symptom
You might encounter an error like this one:
... dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 388.080605ms: serve: egress call failed: rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: proxyconnect tcp: tls: oversized record received with length 20527
- Possible causes
This can mean that the Connect Agent is trying to connect via HTTPS to an HTTP-only proxy. The Connect Agent only supports CONNECT-based HTTP proxies.
- Resolution
You need to reconfigure your proxy environment variables to the following:
http_proxy=http://[PROXY_URL]:[PROXY_PORT] https_proxy=http://[PROXY_URL]:[PROXY_PORT]
oauth2: cannot fetch token
errors
- Symptom
You might encounter an error like this one:
... dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 388.080605ms: serve: egress call failed: rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: read tcp 192.168.1.40:5570->1.1.1.1:80 read: connection reset by peer
- Possible causes
This can mean that the upstream HTTP proxy reset the connection, most likely because this particular URL is not allowed by your HTTP proxy. In the example above, the 1.1.1.1:80 is the HTTP proxy address.
- Resolution
Check that your HTTP proxy allowlist includes the following URLs/domains:
gkeconnect.googleapis.com oauth2.googleapis.com/token www.googleapis.com/oauth2/v1/certs
PermissionDenied
errors
- Symptom
You might encounter an error like this one:
tunnel.go:250: serve: recv error: rpc error: code = PermissionDenied desc = The caller does not have permission dialer.go:210: dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 335.153278ms: serve: receive request failed: rpc error: code = PermissionDenied desc = The caller does not have permission dialer.go:150: dialer: connection done: serve: receive request failed: rpc error: code = PermissionDenied desc = The caller does not have permission dialer.go:228: dialer: backoff: 1m14.1376766s
- Possible causes
This can mean that you have not bound the required Identity and Access Management (IAM) role to the Google Cloud service account you created to authorize the Connect Agent to connect to Google. The Google Cloud service account requires the
gkehub.connect
IAM role.This can also happen if you delete and recreate the Google Cloud service account with the same name. You also need to delete and recreate the IAM role binding in that case. Refer to Deleting and recreating service accounts for more information.
- Resolution
Bind the
gkehub.connect
role to your service account (note that thegkehub.admin
role does not have the proper permissions for connecting and is not meant to be used by service accounts).For example, for a project called
my-project
and a Google Cloud service account calledgkeconnect@my-project.iam.gserviceaccount.com
, you'd run the following command to bind the role to the service account:gcloud projects add-iam-policy-binding my-project --member \ serviceAccount:gkeconnect@my-project.iam.gserviceaccount.com \ --role "roles/gkehub.connect"
You can view and verify the service account permissions have been applied to a Google Cloud service account by examining the output of the following command, and you should see the
role: roles/gkehub.connect
bound to the associated Google Cloud service Account.gcloud projects get-iam-policy my-project
Error when binding IAM role to the Google Cloud service account
- Symptom
You might encounter an error like this one:
ERROR: (gcloud.projects.add-iam-policy-binding) PERMISSION_DENIED: Service Management API has not been used in project [PROJECT_ID] before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/servicemanagement.googleapis.com/overview?project=[PROJECT_ID] then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
- Possible causes
You might not have the IAM permissions to run the command
gcloud projects add-iam-policy-binding
.- Resolution
You need to have the permission
resourcemanager.projects.setIamPolicy
If you have theProject IAM Admin
,Owner
orEditor
roles, you should be able to run the command. If an internal security policy prohibits you from running the command, check with your administrator.
Error from invalid Service Account key
- Symptom
You might encounter an error like this one:
2020/05/08 01:22:21.435104 environment.go:214: Got ExternalID 3770f509-b89b-48c4-96e0-860bb70b3a58 from namespace kube-system. 2020/05/08 01:22:21.437976 environment.go:485: Using GCP Service Account key 2020/05/08 01:22:21.438140 gkeconnect_agent.go:50: error creating kubernetes connect agent: failed to get tunnel config: unexpected end of JSON input
- Possible causes
These logs indicate that the Connect Agent was provided with an invalid Service Account key during installation.
- Resolution
Create a new JSON file containing Service Account credentials and then follow the steps to Register a cluster again to reinstall the Connect Agent.
Error from expired Service Account key
- Symptom
You might encounter an error like this one:
2020/05/08 01:22:21.435104 dialer.go:277: dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 37.901608ms: serve: egress call failed: rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: 400 Bad Request Response: {"error":"invalid_grant","error_description":"Invalid JWT Signature."}
- Possible causes
These logs indicate that the Connect Agent was dialing Connect with a problematic Service Account key which is not acceptable by Google anthentication service. The service account key file might be corrupted or the key has expired.
To make sure the key has expired, you can find the "key expiration date" on a service account's detail page in the Google Cloud Console. In some cases, your project/org may have some special policy that a service account can only have a short lifetime by default.
- Resolution
Create a new JSON file containing Service Account credentials and then follow the steps to Register a cluster again to reinstall the Connect Agent.
Error from skewed system clock
- Symptom
You might encounter an error like this one:
acceptCall: failed to parse token in req [rpc_id=1]: Token used before issued [rpc_id=1]
- Possible causes
The log message usually indicates that there is a clock skew on the cluster. The cluster-issued token has an out-of-sync timestamp and thus the token is rejected.
- Resolution
To see if the clock is not properly synced. you can run the
date
command on your cluster and compare it with the standard time. Usually a few seconds drift will cause this problem. To resolve this issue, please re-sync your cluster's clock.
Unable to see workloads in Cloud Console
- Symptoms
In the Connect Agent logs, you might observe the following errors:
"https://10.0.10.6:443/api/v1/nodes" YYYY-MM-DDTHH mm:ss.sssZ http.go:86: GET "https://10.0.10.6:443/api/v1/pods" YYYY-MM-DDTHH mm:ss.sssZ http.go:139: Response status: "403 Forbidden" YYYY-MM-DDTHH mm:ss.sssZ http.go:139: Response status: "403 Forbidden"`
- Possible causes
These logs indicate that Google Cloud is attempting to access the cluster using the credentials you provided during registration. 403 errors indicate that the credentials don't have the permissions required to access the cluster.
- Resolution
Check that the token and the account it's bound to, and be sure that it has the appropriate permissions on the cluster.
Context deadline exceeded
- Symptom
You might encounter an error like this one:
2019/03/06 21:08:43.306625 dialer.go:235: dialer: dial: connecting to gkeconnect.googleapis.com:443... 2019/03/06 21:09:13.306893 dialer.go:240: dialer: dial: unable to connect to gkeconnect.googleapis.com:443: context deadline exceeded 2019/03/06 21:09:13.306943 dialer.go:183: dialer: connection done: context deadline exceeded
- Possible causes
This error indicates a low-level TCP networking issue where the Connect Agent cannot talk with gkeconnect.googleapis.com.
- Resolution
Verify that Pod workloads within this cluster can resolve and have outbound connectivity to gkeconnect.googleapis.com on port 443.
Agent connection fails intermittently
- Symptoms
In the Connect Agent logs, you might observe the following errors:
2020/10/06 18:02:34.409749 dialer.go:277: dialer: dial: connection to gkeconnect.googleapis.com:443 failed after 8m0.790286282s: serve: receive request failed: rpc error: code = Unavailable desc = transport is closing 2020/10/06 18:02:34.416618 dialer.go:207: dialer: connection done: serve: receive request failed: rpc error: code = Unavailable desc = transport is closing 2020/10/06 18:02:34.416722 dialer.go:295: dialer: backoff: 978.11948ms 2020/10/06 18:02:34.410097 tunnel.go:651: sendResponse: EOF [rpc_id=52] 2020/10/06 18:02:34.420077 tunnel.go:651: sendResponse: EOF [rpc_id=52] 2020/10/06 18:02:34.420204 tunnel.go:670: sendHalfClose: EOF [rpc_id=52] 2020/10/06 18:02:34.401412 tunnel.go:670: sendHalfClose: EOF [rpc_id=53]
- Possible causes
The connection to Connect closes when Connect Agent does not have enough resources, for example on smaller AWS EC2 instances such as
t3.medium
.- Resolution
Enable T3 unlimited or use an instance type with more resources for your node pools.
Hub cannot access the project
- Symptoms
During some Hub operations (usually cluster registration) you might observe an error similar to the following:
ERROR: (gcloud.container.hub.memberships.register) failed to initialize Default Feature "authorizer", the Hub service account may not have access to your project
- Possible causes
Hub's default service account,
gcp-sa-gkehub
, can accidentally become unbound from a project. The Hub Service Agent is an IAM role that grants the service account the permissions to manage cluster resources. If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from registering clusters and other cluster operations.You can check to see if the service account has been removed from your project using
gcloud
tool or the Cloud Console. If the command or the dashboard do not displaygcp-sa-gkehub
among your service accounts, the service account has become unbound.
gcloud
Run the following command:
gcloud projects get-iam-policy PROJECT_NAME
where PROJECT_NAME
is the name of the project where
you are trying register the cluster.
Console
Visit the IAM & admin page in Cloud Console.
- Resolution
If you removed the Hub Service Agent role binding, run the following commands to restore the role binding:
PROJECT_NUMBER=$(gcloud projects describe PROJECT_NAME --format "value(projectNumber)") gcloud projects add-iam-policy-binding PROJECT_NAME \ --member "serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-gkehub.iam.gserviceaccount.com" \ --role roles/gkehub.serviceAgent
To confirm that the role binding was granted:
gcloud projects get-iam-policy PROJECT_NAME
If you see the service account name along with the
gkehub.serviceAgent
role, the role binding has been granted. For example:- members: - serviceAccount:service-1234567890@gcp-sa-gkehub.iam.gserviceaccount.com role: roles/gkehub.serviceAgent
Error in registering a GKE cluster from a different project than Hub
- Symptoms
While registering a GKE cluster from a project different than the Hub project, you might observe an error similar to the following:
ERROR: (gcloud.container.hub.memberships.register) hub default service account does not have access to the GKE cluster project for //container.googleapis.com/v1/projects/my-project/zones/zone-a/clusters/my-cluster
- Possible causes
The Hub default Service Account does not have the required permissions in the GKE cluster's project.
- Resolution
Grant the Hub default Service Account the required permissions before registering the cluster.
Error when disabling the Hub API
- Symptoms
When trying to disable the Hub API (
gkehub.googleapis.com
), you may encounter an error similar to the following:Not ready to deactivate the service on this project; ensure there are no more resources managed by this service.
- Possible causes
There are still clusters registered to Google Cloud (memberships) or environ-level features enabled in this project. All memberships or features must be unregistered or disabled before the API can be disabled.
To view your current registered clusters, follow the instructions in Viewing registered clusters
To see all active environ-level features for your project:
gcloud and cURL
$ curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://gkehub.googleapis.com/v1alpha1/projects/PROJECT_NAME/locations/global/features
where PROJECT_NAME
is the name of the project where
you are trying to disable the Hub API.
Cloud Console
If you have Anthos enabled in your project, visit the features overview page page in Cloud Console. Features listed as ENABLED are active environ-level features.
- Resolution
First, unregister any clusters still registered to your project environ. All clusters must be unregistered before some features can be disabled.
Once you have done this, disable all environ-level features. Currently, this is only possible with the Hub REST API.
Disable environ-level features that you have enabled for your project
$ gcloud alpha container hub FEATURE_COMMAND disable
Disable feature authorizer and metering, which are enabled by default.
$ curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -X "DELETE" \ https://gkehub.googleapis.com/v1alpha1/projects/PROJECT_NAME/locations/global/features/FEATURE
where
FEATURE
is the name of the feature to disable (such asauthorizer
ormetering
).
Missing cluster permissions when registering a cluster
- Symptom:
While trying to register a cluster with a user account or Google Cloud service account, you may get an error similar to the following:
ERROR: (gcloud.container.hub.memberships.register) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/my-project/zones/zone-a/clusters/my-cluster"
- Possible cause:
The account that is trying to register the cluster does not have the required
cluster-admin
role-based access control (RBAC) role in the cluster.- Resolution:
Grant the
cluster-admin
RBAC role to the account before registering the cluster.
Getting additional help
You can file a ticket with Google Cloud support for Connect, by performing the following steps:
- File a case with Google Cloud Support.
- Follow the instructions in Collecting Connect Agent logs to save the Connect logs.
- Attach the Connect Agent logs to your case.