Troubleshoot Config Connector
This page describes troubleshooting techniques that you can use to troubleshoot Config Connector and common issues that you might encounter when using the product.
Basic troubleshooting techniques
Check the version of Config Connector
Run the following command to get the installed Config Connector version, and cross-reference the release notes to verify that the running version supports the features and resources that you want to use:
kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'
Check the resource's status and events
Usually, you can determine the issue with your Config Connector resources by inspecting the state of your resources in Kubernetes. Checking a resource's status and events is particularly helpful for determining if Config Connector failed to reconcile the resource and why the reconciliation failed.
Check that Config Connector is running
To check that Config Connector is running, verify that all of its Pods are
READY
:
kubectl get pod -n cnrm-system
Example output:
NAME READY STATUS RESTARTS AGE cnrm-controller-manager-0 1/1 Running 0 1h cnrm-deletiondefender-0 1/1 Running 0 1h cnrm-resource-stats-recorder-77dc8cc4b6-mgpgp 1/1 Running 0 1h cnrm-webhook-manager-58496b66f9-pqwhz 1/1 Running 0 1h cnrm-webhook-manager-58496b66f9-wdcn4 1/1 Running 0 1h
If you have Config Connector installed in
namespaced-mode,
then you will have one controller (cnrm-controller-manager
) Pod for each
namespace that is responsible for managing the Config Connector resources in
that namespace.
You can check the status of the controller Pod responsible for a specific namespace by running:
kubectl get pod -n cnrm-system \
-l cnrm.cloud.google.com/scoped-namespace=NAMESPACE \
-l cnrm.cloud.google.com/component=cnrm-controller-manager
Replace NAMESPACE
with the name of the namespace.
Check the controller logs
The controller Pod logs information and errors related to the reconciliation of Config Connector resources.
You can check the controller Pod's logs by running:
kubectl logs -n cnrm-system \
-l cnrm.cloud.google.com/component=cnrm-controller-manager \
-c manager
If you have Config Connector installed in namespaced-mode, then the previous command shows the logs of all controller Pods combined. You can check the logs of the controller Pod for a specific namespace by running:
kubectl logs -n cnrm-system \
-l cnrm.cloud.google.com/scoped-namespace=NAMESPACE \
-l cnrm.cloud.google.com/component=cnrm-controller-manager \
-c manager
Replace NAMESPACE
with the name of the namespace.
Read more about how to inspect and query Config Connector's logs.
Abandon and acquire the resource
In some cases, you might need to update an immutable field in a resource. Since you can't edit immutable fields, you must abandon and then acquire the resource:
- Update the YAML configuration of the Config Connector resource and set
the
cnrm.cloud.google.com/deletion-policy
annotation toabandon
. - Apply the updated YAML configuration to update the Config Connector resource's deletion policy.
- Abandon the Config Connector resource.
- Update the immutable fields that need to be changed in the YAML configuration.
- Apply the updated YAML configuration to acquire the abandoned resource.
Common issues
Resource keeps updating every 5-15 mins
If your Config Connector resource keeps switching from an UpToDate
status to
an Updating
status every 5-10 minutes, then it is likely that
Config Connector is detecting unintentional diffs between the resource's
desired state and actual state, thereby causing Config Connector to constantly
update the resource.
First, confirm that you do not have any external systems that are constantly modifying either the Config Connector or Google Cloud resource (for example, CI/CD pipelines, custom controllers, cron jobs, etc.).
If the behavior is not due to an external system, see if Google Cloud is changing any of the values specified in your Config Connector resource. For example, in some cases, Google Cloud changes the formatting (for example, capitalization) of field values which leads to a diff between your resource's desired state and actual state.
Get the state of the Google Cloud resource using the REST API (for example, for ContainerCluster) or the Google Cloud CLI. Then, compare that state against your Config Connector resource. Look for any fields whose values do not match, then update your Config Connector resource to match. In particular, look for any values that were reformatted by Google Cloud. For example, see GitHub issues #578 and #294.
Note that this is not a perfect method since the Config Connector and Google Cloud resource models are different, but it should let you catch most cases of unintended diffs.
If you are unable to resolve your issue, see Additional help.
Deletions of namespaces stuck at "Terminating"
Deletions of namespaces might get stuck at Terminating
if you have
Config Connector installed in
namespaced-mode
and if the namespace's ConfigConnectorContext
was deleted before all
Config Connector resources in that namespace are deleted. When a namespace's
ConfigConnectorContext
is deleted, Config Connector is disabled for that
namespace, which prevents any remaining Config Connector resources in that
namespace from getting deleted.
To fix this issue, you must do a forced cleanup and then manually delete the underlying Google Cloud resources afterwards.
To mitigate this issue in the future, only delete the ConfigConnectorContext
after all Config Connector resources in its namespace have been deleted from
Kubernetes. Avoid deleting entire namespaces before all Config Connector
resources in that namespace have been deleted since the
ConfigConnectorContext
might get deleted first.
Also see how deleting a namespace containing a Project and its children or a Folder and its children can get stuck.
Deletions of resources stuck at "DeleteFailed" after project was deleted
Deletions of Config Connector resources might get stuck at DeleteFailed
if
their Google Cloud project had been deleted beforehand.
To fix this issue, restore the project on Google Cloud to allow Config Connector to delete remaining child resources from Kubernetes. Alternatively, you can do a forced cleanup.
To mitigate this issue in the future, only delete Google Cloud projects
after all their child Config Connector resources have been deleted from
Kubernetes. Avoid deleting entire namespaces that might contain both a
Project
resource and its child Config Connector resources since the Project
resource might get deleted first.
Compute Engine Metadata not defined
If your Config Connector resource has an UpdateFailed
status with a message
stating that the Compute Engine metadata is not defined, then that likely means that the
IAM service account used by Config Connector does not exist.
Example UpdateFailed
message:
Update call failed: error fetching live state: error reading underlying resource: summary: Error when reading or editing SpannerInstance "my-project/my-spanner- instance": Get "https://spanner.googleapis.com/v1/projects/my-project/instances/my-spanner-instance?alt=json": metadata: Compute Engine metadata "instance/service-accounts/default/token? scopes=https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)compute%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSIN G)auth%!F(MISSING)cloud-platform%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)cloud-identity%!C(MISSING)https%!A(MISSING)%!F(MISS ING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)ndev.clouddns.readwrite%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSIN G)devstorage.full_control%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)userinfo.email%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F (MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)drive.readonly" not defined, detail:
To fix the issue, ensure that the IAM service account used by Config Connector exists.
To mitigate this issue in the future, ensure that you follow the Config Connector installation instructions.
Error 403: Request had insufficient authentication scopes
If your Config Connector resource has an UpdateFailed
status with a message
indicating a 403 error due to insufficient authentication scopes, then that
likely means that Workload
Identity is not enabled on
your GKE cluster.
Example UpdateFailed
message:
Update call failed: error fetching live state: error reading underlying resource: summary: Error when reading or editing SpannerInstance "my-project/my-spanner-instance": googleapi: Error 403: Request had insufficient authentication scopes.
To investigate, complete the following steps:
Save the following Pod configuration as
wi-test.yaml
:apiVersion: v1 kind: Pod metadata: name: workload-identity-test namespace: cnrm-system spec: containers: - image: google/cloud-sdk:slim name: workload-identity-test command: ["sleep","infinity"] serviceAccountName: cnrm-controller-manager
If you installed Config Connector using namespaced mode,
serviceAccountName
should becnrm-controller-manager-NAMESPACE
. ReplaceNAMESPACE
with namespace you used during the installation.Create the Pod in your GKE cluster:
kubectl apply -f wi-test.yaml
Open an interactive session in the Pod:
kubectl exec -it workload-identity-test \ --namespace cnrm-system \ -- /bin/bash
List your identity:
gcloud auth list
Verify that the identity listed matches the Google service account bound to your resources.
If you see the Compute Engine default service account instead, then that means that Workload Identity Federation for GKE is not enabled on your GKE cluster and/or node pool.
Exit the interactive session, then delete the Pod from your GKE cluster:
kubectl delete pod workload-identity-test \ --namespace cnrm-system
To fix this issue, use a GKE cluster with Workload Identity Federation for GKE enabled.
If you're still seeing the same error on a GKE cluster with Workload Identity Federation for GKE enabled, ensure that you did not forget to also enable Workload Identity Federation for GKE on the cluster's node pools. Read more about enabling Workload Identity Federation for GKE on existing node pools. We recommend enabling Workload Identity Federation for GKE on all your cluster's node pools since Config Connector could run on any of them.
403 Forbidden: The caller does not have permission; refer to the Workload Identity Federation for GKE documentation
If your Config Connector resource has an UpdateFailed
status with a message
indicating a 403 error due to Workload Identity Federation for GKE, then that likely means that
Config Connector's Kubernetes service account is missing the appropriate
IAM permissions to impersonate your IAM service
account as a Workload Identity Federation for GKE user.
Example UpdateFailed
message:
Update call failed: error fetching live state: error reading underlying resource: summary: Error when reading or editing SpannerInstance "my-project/my-spanner- instance": Get "https://spanner.googleapis.com/v1/projects/my-project/instances/my-spanner-instance?alt=json": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: The caller does not have permission This error could be caused by a missing IAM policy binding on the target IAM service account. For more information, refer to the Workload Identity Federation for GKE documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#creating_a_relationship_between_ksas_and_gsas
To fix and mitigate the issue in the future, refer to the Config Connector installation instructions.
Error 403: Caller is missing IAM permission
If your Config Connector resource has an UpdateFailed
status with a message
stating that the caller is missing an IAM permission, that likely means that
the IAM service account used by Config Connector is missing the
IAM permission stated in the message that is needed to manage
the Google Cloud resource.
Example UpdateFailed
message:
Update call failed: error fetching live state: error reading underlying resource: summary: Error when reading or editing SpannerInstance "my-project/my-spanner- instance": googleapi: Error 403: Caller is missing IAM permission spanner.instances.get on resource projects/my-project/instances/my-spanner-instance., detail:
If you're still seeing the same error after granting your IAM
service account the appropriate IAM permissions, then check that
your resource is being created in the correct project. Check the
Config Connector resource's spec.projectRef
field (or its
cnrm.cloud.google.com/project-id
annotation if the resource doesn't support a
spec.projectRef
field) and verify that the resource is referencing the
correct project. Note that Config Connector uses the namespace's name as the
project ID if neither the resource nor namespace specifies a target project.
Read more about how to configure the target project for project-scoped
resources.
If you're still seeing the same error, then check if Workload Identity Federation for GKE is enabled on your GKE cluster.
To mitigate this issue in the future, ensure that you follow the Config Connector installation instructions.
Version not supported in Config Connector add-on installations
If you can't enable the Config Connector add-on successfully, the following
error message appears: Node version 1.15.x-gke.x s unsupported
. To solve this
error, verify that the version of the GKE cluster meets the
version and feature requirements.
To get all valid versions for your clusters, run the following command:
gcloud container get-server-config --format "yaml(validMasterVersions)" \
--zone ZONE
Replace ZONE with the Compute Engine zone.
Pick a version from the list that meets the requirements.
The error message also appears if Workload Identity Federation for GKE or GKE Monitoring are disabled. Ensure these features are enabled to fix the error.
Cannot make changes to immutable fields
Config Connector rejects updates to immutable fields at admission.
For example, updating an immutable field with kubectl apply
causes the
command to fail immediately.
This means that tools which continuously re-apply resources (for example, GitOps) might find themselves getting stuck while updating a resource if they don't handle admission errors.
Since Config Connector does not allow updates to immutable fields, the only way to perform such an update is to delete and re-create the resource.
Error updating the immutable fields when there is no update
You might see the following errors in the status of the Config Connector resource shortly after you create or acquire a Google Cloud resource using Config Connector:
Update call failed: error applying desired state: infeasible update: ({true \<nil\>}) would require recreation
(example)Update call failed: cannot make changes to immutable field(s)
(example)
This might not mean that you've actually updated the resource, but the reason might be that the Google Cloud API has made a change to an immutable field that was managed by you in the Config Connector resource. This caused the mismatch between the desired state and the live state of the immutable fields.
You can resolve the issue by updating the values of those immutable fields in the Config Connector resource to match the live state. To achieve it, you should complete the following steps:
- Update the YAML configuration of the Config Connector resource and set
the
cnrm.cloud.google.com/deletion-policy
annotation toabandon
. - Apply the updated YAML configuration to update the Config Connector resource's deletion policy.
- Abandon the Config Connector resource.
- Print out the live state of the corresponding Google Cloud resource using gcloud CLI.
- Find the mismatch in between the gcloud CLI output and the YAML configuration of the Config Connector resource, and update those fields in the YAML configuration.
- Apply the updated YAML configuration to acquire the abandoned resource.
Resource has no status
If your resources don't have a status
field, then it is likely that
Config Connector is not running properly. Check that Config Connector is
running.
No matches for kind "Foo"
When this error is encountered, it means that your Kubernetes cluster does not
have the CRD for the Foo
resource kind installed.
Verify that the kind is a resource kind supported by Config Connector.
If the kind is supported, then that means your Config Connector installation is either out-of-date or invalid.
If you installed Config Connector using the GKE add-on, then your installation should be upgraded automatically. If you manually installed Config Connector, then you must perform a manual upgrade.
Check the GitHub repository to determine which resource kinds are supported by which Config Connector versions (for example, here are the kinds supported by Config Connector v1.44.0).
Labels are not propagated to the Google Cloud resource
Config Connector propagates labels found in metadata.labels
to the underlying
Google Cloud resource. However, note that not all Google Cloud
resources support labels. Check the resource's REST API documentation (for
example, here is the API documentation for
PubSubTopic) to see if they
support labels.
Failed calling webhook x509: certificate relies on legacy Common Name field
If you see an error similar to the following example, you might be experiencing an issue with certificates:
Error from server (InternalError): error when creating "/mnt/set-weaver-dns-record.yml": Internal error occurred: failed calling webhook "annotation-defaulter.cnrm.cloud.google.com": Post "https://cnrm-validating-webhook.cnrm-system.svc:443/annotation-defaulter?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0
To workaround this issue, delete the relevant certificates and the Pods:
kubectl delete -n cnrm-system secrets cnrm-webhook-cert-abandon-on-uninstall
kubectl delete -n cnrm-system secrets cnrm-webhook-cert-cnrm-validating-webhook
kubectl delete -n cnrm-system pods -l "cnrm.cloud.google.com/component=cnrm-webhook-manager"
After you have deleted these resources, the correct certificate regenerates.
For more information about this error, see the GitHub issue.
Error due to special characters in resource name
Special characters are not valid in the Kubernetes metadata.name
field.
If you see an error similar to the following example, then the
resource's metadata.name
likely has a value with special characters:
a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
For example, the following
SQLUser
resource contains an invalid character in metadata.name
:
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
name: test.example@example-project.iam
spec:
instanceRef:
name: test-cloudsql-db
type: "CLOUD_IAM_USER"
If you try to create this resource, you get the following error:
Error from server (Invalid): error when creating "sqlusercrd.yaml": SQLUser.sql.cnrm.cloud.google.com "test.example@example-project.iam" is invalid: metadata.name: Invalid value: "test.example@example-project.iam": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
If you'd like to give your resource a name that is not a valid Kubernetes name, but is a valid Google Cloud resource name, you can use the resourceID, field as shown in the following example:
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
name: 'test'
spec:
instanceRef:
name: sqlinstance-sample-postgresql
host: "%"
type: CLOUD_IAM_USER
resourceID: test.example@example-project.iam
This configuration causes Config Connector to use resourceID
instead of
metadata.name
as the name of the resource.
Unable to remove fields from resource spec
Removing a field from a Config Connector resource's spec (by updating the
resource's .yaml
file and re-applying, or by using kubectl edit
to edit the
resource spec) does not actually remove that field from either the
Config Connector resource's spec or the underlying Google Cloud resource.
Instead, removing a field from the spec just makes that field
externally-managed.
If you want to change value of a field to empty or default in the underlying Google Cloud resource, you'll have to zero-out the field in the Config Connector resource spec:
For list type field, set the field to an empty list by using
[]
.The following example shows the
targetServiceAccounts
field that we want to remove:spec: targetServiceAccounts: - external: "foo-bar@foo-project.iam.gserviceaccount.com" - external: "bar@foo-project.iam.gserviceaccount.com"
To remove this field, set the value to empty:
spec: targetServiceAccounts: []
For primitive type field, set the field to empty by using one of the following:
Type Empty value string "" bool "false" integer 0 The following example shows the
identityNamespace
field that we want to remove:spec: workloadIdentityConfig: identityNamespace: "foo-project.svc.id.goog"
To remove this field, set the value to empty:
spec: workloadIdentityConfig: identityNamespace: ""
For object type fields, currently in Config Connector there is no easy way to set a whole object type field as "NULL". You can try to set the subfields of the object type as empty or default following the guidance above and verify if it works.
KNV2005: syncer excessively updating resource
If you are using Config Sync and you are seeing KNV2005 errors for Config Connector resources, then it is likely that Config Sync and Config Connector are fighting over the resource.
Example log message:
KNV2005: detected excessive object updates, approximately 6 times per minute. This may indicate Config Sync is fighting with another controller over the object.
Config Sync and Config Connector are said to be "fighting" over a resource if they keep updating the same field(s) to different values. One's update triggers the other to act and update the resource, which causes the other to act and update the resource, and so on.
Fighting is not a problem for most fields. Fields that are specified in Config Sync are not changed by Config Connector, while fields that are not specified in Config Sync and defaulted by Config Connector are ignored by Config Sync. Therefore, for most fields, Config Sync and Config Connector should never end up updating the same field to different values.
There is one exception: list fields. Similar to how Config Connector may default subfields in object fields, Config Connector may also default subfields in objects inside lists. However, since list fields in Config Connector resources are atomic, the defaulting of subfields is considered as changing the value of the list entirely.
Therefore, Config Sync and Config Connector will end up fighting if Config Sync sets a list field and Config Connector defaults any subfields within that list.
To work around this issue, you have the following options:
Update the resource manifest in the Config Sync repository to match what Config Connector is trying to set the resource to.
One way to do this is to temporarily stop syncing configs, wait for Config Connector to finish reconciling the resource, and then update your resource manifest to match the resource on the Kubernetes API Server.
Stop Config Sync from reacting to updates to the resource on the Kubernetes API Server by setting the annotation
client.lifecycle.config.k8s.io/mutation
toignore
. Read more about how to have Config Sync ignore object mutations.Stop Config Connector from updating the resource's spec entirely by setting the annotation
cnrm.cloud.google.com/state-into-spec
toabsent
on the resource. This annotation is not supported for all resources. To see if your resource supports the annotation, check the corresponding resource reference page. Read more about the annotation.
failed calling webhook
It's possible for Config Connector to be in a state where you cannot uninstall Config Connector. This commonly happens when using the Config Connector add-on and disabling Config Connector before removing the Config Connector CRDs. When trying uninstall, you receive an error similar to the following:
error during reconciliation: error building deployment objects: error finalizing the deletion of Config Connector system components deployed by ConfigConnector controller: error waiting for CRDs to be deleted: error deleting CRD accesscontextmanageraccesslevels.accesscontextmanager.cnrm.cloud.google.com: Internal error occurred: failed calling webhook "abandon-on-uninstall.cnrm.cloud.google.com": failed to call webhook: Post "https://abandon-on-uninstall.cnrm-system.svc:443/abandon-on-uninstall?timeout=3s": service "abandon-on-uninstall" not found
To resolve this error, you must first manually delete the webhooks:
kubectl delete validatingwebhookconfiguration abandon-on-uninstall.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete validatingwebhookconfiguration validating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete mutatingwebhookconfiguration mutating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
You can then proceed to uninstall Config Connector.
Update error with IAMPolicy, IAMPartialPolicy and IAMPolicyMember
If you delete an IAMServiceAccount
Config Connector resource before cleaning up IAMPolicy
,IAMPartialPolicy
, and IAMPolicyMember
resources that depend on that service account,
Config Connector cannot locate the service account referenced in those IAM resources during reconciliation. This results in an UpdateFailed
status with an error message like the following:
Update call failed: error setting policy member: error applying changes: summary: Request `Create IAM Members roles/[MYROLE] serviceAccount:[NAME]@[PROJECT_ID].iam.gserviceaccount.com for project \"projects/[PROJECT_ID]\"` returned error: Error applying IAM policy for project \"projects/[PROJECT_ID]\": Error setting IAM policy for project \"projects/[PROJECT_ID]\": googleapi: Error 400: Service account [NAME]@[PROJECT_ID].iam.gserviceaccount.com does not exist., badRequest
To resolve this issue, check your service accounts and see if the required service account for those IAM resources is deleted.
If the service account is deleted, clean up the related IAM Config Connector resources, too. For IAMPolicyMember
, delete the whole resource. For IAMPolicy
and IAMParitialPolicy
, only remove the bindings that involve the deleted service account.
However, such cleanup doesn't remove Google Cloud role bindings immediately. The Google Cloud role bindings are retained for 60 days because of the retention on the deleted service account.
For more information, see the Google Cloud IAM documentation about Delete a service account.
To avoid this issue, you should always clean up IAMPolicy
, IAMPartialPolicy
,IAMPolicyMember
Config Connector resources before deleting the referenced IAMServiceAccount
.
Resource deleted by Config Connector
Config Connector never deletes your resources without an external cause.
For example, running kubectl delete
, using config management tools like
Argo CD, or using a customized API client can cause resource deletion.
A common misconception is that Config Connector has initiated and deleted some of the resources in your cluster. For example, if you are using Config Connector, you may notice delete requests from Config Connector controller manager against certain resources from either container log messages or Kubernetes cluster audit logs. These delete requests are a result of external triggers and Config Connector is trying to reconcile the delete requests.
To determine why a resource was deleted, you need to look into the first delete request that was sent to the corresponding resource. The best way to look into this is by examining the Kubernetes cluster audit logs.
As an example, if you are using GKE, you can
use Cloud Logging to query for
GKE cluster audit logs. For example, if you want to look
for the initial delete requests for a BigQueryDataset
resource named foo
in
namespace bar
, you would run a query like the following:
resource.type="k8s_cluster"
resource.labels.project_id="my-project-id"
resource.labels.cluster_name="my-cluster-name"
protoPayload.methodName="com.google.cloud.cnrm.bigquery.v1beta1.bigquerydatasets.delete"
protoPayload.resourceName="bigquery.cnrm.cloud.google.com/v1beta1/namespaces/bar/bigquerydatasets/foo"
Using this query, you would look for the first delete request and then check
authenticationInfo.principalEmail
of that delete log message to determine the
cause of the deletion.
Controller Pod OOMKilled
If you see an OOMKilled error on a Config Connector controller Pod, it indicates that
a container or the entire Pod was terminated because they used more memory than allowed.
This can be verified by running the kubectl describe command. The Pod's status may appear as OOMKilled
or Terminating
.
Additionally, scrutinizing the Pod's event logs can reveal any occurrences of OOM-related events.
kubectl describe pod POD_NAME -n cnrm-system
Replace POD_NAME
with the Pod you are troubleshooting.
To address this issue, you can use the ControllerResource custom resource to increase the memory request and the memory limit for the Pod.
PodSecurityPolicy
prevents upgrades
After
switching from the Config Connector add-on to a manual install
and upgrading Config Connector to a new version, the use of PodSecurityPolicies
can prevent cnrm
Pods from updating.
To confirm that the PodSecurityPolicies are preventing your upgrade,
check the config-connector-operator
's events
and look for an error similar to the following:
create Pod configconnector-operator-0 in StatefulSet configconnector-operator failed error: pods "configconnector-operator-0" is forbidden: PodSecurityPolicy: unable to admit pod: [pod.metadata.annotations[seccomp.security.alpha.kubernetes.io/pod]: Forbidden: seccomp may not be set pod.metadata.annotations[container.seccomp.security.alpha.kubernetes.io/manager]: Forbidden: seccomp may not be set]
To resolve this issue, you must
specify the annotation on the PodSecurityPolicy
that corresponds to the annotation mentioned in the error. In the
previous example, the annotation is seccomp.security.alpha.kubernetes.io
.
Forced cleanup
If your Config Connector resources are stuck on deletion and you simply want to get rid of them from your Kubernetes cluster, you can force their deletion by deleting their finalizers.
You can delete a resource's finalizers by editing the resource using kubectl
edit
, deleting the metadata.finalizers
field, and then saving the file to
preserve your changes to the Kubernetes API Server.
Since deleting a resource's finalizers allows the resource to be immediately deleted from the Kubernetes cluster, Config Connector might (but not necessarily) not get a chance to complete the deletion of the underlying Google Cloud resource. This means that you might want to manually delete your Google Cloud resources afterwards.
Monitoring
Metrics
You can use Prometheus to collect and show metrics from Config Connector.
Logging
All Config Connector Pods output structured logs in JSON format.
The logs of the controller Pods are particularly useful for debugging issues with the reconciliation of resources.
You can query for logs for specific resources by filtering for the following fields in the log messages:
logger
: contains the resource's kind in lower-case. For example,PubSubTopic
resources have alogger
ofpubsubtopic-controller
.resource.namespace
: contains the resource's namespace.resource.name
: contains the resource's name.
Using Cloud Logging for advanced log querying
If you are using GKE, you can use Cloud Logging to query for logs for a specific resource with the following query:
# Filter to include only logs coming from the controller Pods
resource.type="k8s_container"
resource.labels.container_name="manager"
resource.labels.namespace_name="cnrm-system"
labels.k8s-pod/cnrm_cloud_google_com/component="cnrm-controller-manager"
# Filter to include only logs coming from a particular GKE cluster
resource.labels.cluster_name="GKE_CLUSTER_NAME"
resource.labels.location="GKE_CLUSTER_LOCATION"
# Filter to include only logs for a particular Config Connector resource
jsonPayload.logger="RESOURCE_KIND-controller"
jsonPayload.resource.namespace="RESOURCE_NAMESPACE"
jsonPayload.resource.name="RESOURCE_NAME"
Replace the following:
GKE_CLUSTER_NAME
with the name of the GKE cluster running Config ConnectorGKE_CLUSTER_LOCATION
with the location of the GKE cluster running Config Connector. For example,us-central1
.RESOURCE_KIND
with the resource's kind in lower-case. For example,pubsubtopic
.RESOURCE_NAMESPACE
with the resource's namespace.RESOURCE_NAME
with the resource's name.
Additional help
To get additional help, you can file an issue on GitHub or contact Google Cloud Support.