Troubleshoot Config Connector


This page describes troubleshooting techniques that you can use to troubleshoot Config Connector and common issues that you might encounter when using the product.

Basic troubleshooting techniques

Check the version of Config Connector

Run the following command to get the installed Config Connector version, and cross-reference the release notes to verify that the running version supports the features and resources that you want to use:

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'

Check the resource's status and events

Usually, you can determine the issue with your Config Connector resources by inspecting the state of your resources in Kubernetes. Checking a resource's status and events is particularly helpful for determining if Config Connector failed to reconcile the resource and why the reconciliation failed.

Check that Config Connector is running

To check that Config Connector is running, verify that all of its Pods are READY:

kubectl get pod -n cnrm-system

Example output:

NAME                                            READY   STATUS    RESTARTS   AGE
cnrm-controller-manager-0                       1/1     Running   0          1h
cnrm-deletiondefender-0                         1/1     Running   0          1h
cnrm-resource-stats-recorder-77dc8cc4b6-mgpgp   1/1     Running   0          1h
cnrm-webhook-manager-58496b66f9-pqwhz           1/1     Running   0          1h
cnrm-webhook-manager-58496b66f9-wdcn4           1/1     Running   0          1h

If you have Config Connector installed in namespaced-mode, then you will have one controller (cnrm-controller-manager) Pod for each namespace that is responsible for managing the Config Connector resources in that namespace.

You can check the status of the controller Pod responsible for a specific namespace by running:

kubectl get pod -n cnrm-system \
    -l cnrm.cloud.google.com/scoped-namespace=NAMESPACE \
    -l cnrm.cloud.google.com/component=cnrm-controller-manager

Replace NAMESPACE with the name of the namespace.

Check the controller logs

The controller Pod logs information and errors related to the reconciliation of Config Connector resources.

You can check the controller Pod's logs by running:

kubectl logs -n cnrm-system \
    -l cnrm.cloud.google.com/component=cnrm-controller-manager \
    -c manager

If you have Config Connector installed in namespaced-mode, then the previous command shows the logs of all controller Pods combined. You can check the logs of the controller Pod for a specific namespace by running:

kubectl logs -n cnrm-system \
    -l cnrm.cloud.google.com/scoped-namespace=NAMESPACE \
    -l cnrm.cloud.google.com/component=cnrm-controller-manager \
    -c manager

Replace NAMESPACE with the name of the namespace.

Read more about how to inspect and query Config Connector's logs.

Common issues

Resource keeps updating every 5-15 mins

If your Config Connector resource keeps switching from an UpToDate status to an Updating status every 5-10 minutes, then it is likely that Config Connector is detecting unintentional diffs between the resource's desired state and actual state, thereby causing Config Connector to constantly update the resource.

First, confirm that you do not have any external systems that are constantly modifying either the Config Connector or Google Cloud resource (for example, CI/CD pipelines, custom controllers, cron jobs, etc.).

If the behavior is not due to an external system, see if Google Cloud is changing any of the values specified in your Config Connector resource. For example, in some cases, Google Cloud changes the formatting (for example, capitalization) of field values which leads to a diff between your resource's desired state and actual state.

Get the state of the Google Cloud resource using the REST API (for example, for ContainerCluster) or the Google Cloud CLI. Then, compare that state against your Config Connector resource. Look for any fields whose values do not match, then update your Config Connector resource to match. In particular, look for any values that were reformatted by Google Cloud. For example, see GitHub issues #578 and #294.

Note that this is not a perfect method since the Config Connector and Google Cloud resource models are different, but it should let you catch most cases of unintended diffs.

If you are unable to resolve your issue, see Additional help.

Deletions of namespaces stuck at "Terminating"

Deletions of namespaces might get stuck at Terminating if you have Config Connector installed in namespaced-mode and if the namespace's ConfigConnectorContext was deleted before all Config Connector resources in that namespace are deleted. When a namespace's ConfigConnectorContext is deleted, Config Connector is disabled for that namespace, which prevents any remaining Config Connector resources in that namespace from getting deleted.

To fix this issue, you must do a forced cleanup and then manually delete the underlying Google Cloud resources afterwards.

To mitigate this issue in the future, only delete the ConfigConnectorContext after all Config Connector resources in its namespace have been deleted from Kubernetes. Avoid deleting entire namespaces before all Config Connector resources in that namespace have been deleted since the ConfigConnectorContext might get deleted first.

Also see how deleting a namespace containing a Project and its children or a Folder and its children can get stuck.

Deletions of resources stuck at "DeleteFailed" after project was deleted

Deletions of Config Connector resources might get stuck at DeleteFailed if their Google Cloud project had been deleted beforehand.

To fix this issue, restore the project on Google Cloud to allow Config Connector to delete remaining child resources from Kubernetes. Alternatively, you can do a forced cleanup.

To mitigate this issue in the future, only delete Google Cloud projects after all their child Config Connector resources have been deleted from Kubernetes. Avoid deleting entire namespaces that might contain both a Project resource and its child Config Connector resources since the Project resource might get deleted first.

Compute Engine Metadata not defined

If your Config Connector resource has an UpdateFailed status with a message stating that the Compute Engine metadata is not defined, then that likely means that the IAM service account used by Config Connector does not exist.

Example UpdateFailed message:

Update call failed: error fetching live state: error reading underlying
resource: summary: Error when reading or editing SpannerInstance
"my-project/my-spanner- instance": Get
"https://spanner.googleapis.com/v1/projects/my-project/instances/my-spanner-instance?alt=json":
metadata: Compute Engine metadata "instance/service-accounts/default/token?
scopes=https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)compute%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSIN
G)auth%!F(MISSING)cloud-platform%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)cloud-identity%!C(MISSING)https%!A(MISSING)%!F(MISS
ING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)ndev.clouddns.readwrite%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSIN
G)devstorage.full_control%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)userinfo.email%!C(MISSING)https%!A(MISSING)%!F(MISSING)%!F
(MISSING)www.googleapis.com%!F(MISSING)auth%!F(MISSING)drive.readonly" not
defined, detail:

To fix the issue, ensure that the IAM service account used by Config Connector exists.

To mitigate this issue in the future, ensure that you follow the Config Connector installation instructions.

Error 403: Request had insufficient authentication scopes

If your Config Connector resource has an UpdateFailed status with a message indicating a 403 error due to insufficient authentication scopes, then that likely means that Workload Identity is not enabled on your GKE cluster.

Example UpdateFailed message:

Update call failed: error fetching live state: error reading underlying
resource: summary: Error when reading or editing SpannerInstance
"my-project/my-spanner-instance": googleapi: Error 403: Request had
insufficient authentication scopes.

To investigate, complete the following steps:

  1. Save the following Pod configuration as wi-test.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: workload-identity-test
      namespace: cnrm-system
    spec:
      containers:
      - image: google/cloud-sdk:slim
        name: workload-identity-test
        command: ["sleep","infinity"]
      serviceAccountName: cnrm-controller-manager
    

    If you installed Config Connector using namespaced mode, serviceAccountName should be cnrm-controller-manager-NAMESPACE. Replace NAMESPACE with namespace you used during the installation.

  2. Create the Pod in your GKE cluster:

    kubectl apply -f wi-test.yaml
    
  3. Open an interactive session in the Pod:

    kubectl exec -it workload-identity-test \
        --namespace cnrm-system \
        -- /bin/bash
    
  4. List your identity:

    gcloud auth list
    
  5. Verify that the identity listed matches the Google service account bound to your resources.

    If you see the Compute Engine default service account instead, then that means that Workload Identity is not enabled on your GKE cluster and/or node pool.

  6. Exit the interactive session, then delete the Pod from your GKE cluster:

    kubectl delete pod workload-identity-test \
        --namespace cnrm-system
    

To fix this issue, use a GKE cluster with Workload Identity enabled.

If you're still seeing the same error on a GKE cluster with Workload Identity enabled, ensure that you did not forget to also enable Workload Identity on the cluster's node pools. Read more about enabling Workload Identity on existing node pools. We recommend enabling Workload Identity on all your cluster's node pools since Config Connector could run on any of them.

403 Forbidden: The caller does not have permission; refer to the Workload Identity documentation

If your Config Connector resource has an UpdateFailed status with a message indicating a 403 error due to Workload Identity, then that likely means that Config Connector's Kubernetes service account is missing the appropriate IAM permissions to impersonate your IAM service account as a Workload Identity user.

Example UpdateFailed message:

Update call failed: error fetching live state: error reading underlying
resource: summary: Error when reading or editing SpannerInstance
"my-project/my-spanner- instance": Get
"https://spanner.googleapis.com/v1/projects/my-project/instances/my-spanner-instance?alt=json":
compute: Received 403 `Unable to generate access token; IAM returned 403
Forbidden: The caller does not have permission
This error could be caused by a missing IAM policy binding on the target IAM
service account.
For more information, refer to the Workload Identity documentation:
  https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#creating_a_relationship_between_ksas_and_gsas

To fix and mitigate the issue in the future, refer to the Config Connector installation instructions.

Error 403: Caller is missing IAM permission

If your Config Connector resource has an UpdateFailed status with a message stating that the caller is missing an IAM permission, that likely means that the IAM service account used by Config Connector is missing the IAM permission stated in the message that is needed to manage the Google Cloud resource.

Example UpdateFailed message:

Update call failed: error fetching live state: error reading underlying
resource: summary: Error when reading or editing SpannerInstance
"my-project/my-spanner- instance": googleapi: Error 403: Caller is missing IAM
permission spanner.instances.get on resource
projects/my-project/instances/my-spanner-instance., detail:

If you're still seeing the same error after granting your IAM service account the appropriate IAM permissions, then check that your resource is being created in the correct project. Check the Config Connector resource's spec.projectRef field (or its cnrm.cloud.google.com/project-id annotation if the resource doesn't support a spec.projectRef field) and verify that the resource is referencing the correct project. Note that Config Connector uses the namespace's name as the project ID if neither the resource nor namespace specifies a target project. Read more about how to configure the target project for project-scoped resources.

If you're still seeing the same error, then check if Workload Identity is enabled on your GKE cluster.

To mitigate this issue in the future, ensure that you follow the Config Connector installation instructions.

Version not supported in Config Connector add-on installations

If you can't enable the Config Connector add-on successfully, the following error message appears: Node version 1.15.x-gke.x s unsupported. To solve this error, verify that the version of the GKE cluster meets the version and feature requirements.

To get all valid versions for your clusters, run the following command:

gcloud container get-server-config --format "yaml(validMasterVersions)" \
    --zone ZONE

Replace ZONE with the Compute Engine zone.

Pick a version from the list that meets the requirements.

The error message also appears if Workload Identity or GKE Monitoring are disabled. Ensure these features are enabled to fix the error.

Cannot make changes to immutable fields

Config Connector rejects updates to immutable fields at admission.

For example, updating an immutable field with kubectl apply causes the command to fail immediately.

This means that tools which continuously re-apply resources (for example, GitOps) might find themselves getting stuck while updating a resource if they don't handle admission errors.

Since Config Connector does not allow updates to immutable fields, the only way to perform such an update is to delete and re-create the resource.

Error updating the immutable fields when there is no update

You might see the following errors in the status of the Config Connector resource shortly after you create or acquire a Google Cloud resource using Config Connector:

  • Update call failed: error applying desired state: infeasible update: ({true \<nil\>}) would require recreation (example)

  • Update call failed: cannot make changes to immutable field(s) (example)

This might not mean that you've actually updated the resource, but the reason might be that the Google Cloud API has made a change to an immutable field that was managed by you in the Config Connector resource. This caused the mismatch between the desired state and the live state of the immutable fields.

You can resolve the issue by updating the values of those immutable fields in the Config Connector resource to match the live state. To achieve it, you should complete the following steps:

  1. Update the YAML configuration of the Config Connector resource and set the cnrm.cloud.google.com/deletion-policy annotation to abandon.
  2. Apply the updated YAML configuration to update the Config Connector resource's deletion policy.
  3. Abandon the Config Connector resource.
  4. Print out the live state of the corresponding Google Cloud resource using gcloud CLI.
  5. Find the mismatch in between the gcloud CLI output and the YAML configuration of the Config Connector resource, and update those fields in the YAML configuration.
  6. Apply the updated YAML configuration to acquire the abandoned resource.

Resource has no status

If your resources don't have a status field, then it is likely that Config Connector is not running properly. Check that Config Connector is running.

No matches for kind "Foo"

When this error is encountered, it means that your Kubernetes cluster does not have the CRD for the Foo resource kind installed.

Verify that the kind is a resource kind supported by Config Connector.

If the kind is supported, then that means your Config Connector installation is either out-of-date or invalid.

If you installed Config Connector using the GKE add-on, then your installation should be upgraded automatically. If you manually installed Config Connector, then you must perform a manual upgrade.

Check the GitHub repository to determine which resource kinds are supported by which Config Connector versions (for example, here are the kinds supported by Config Connector v1.44.0).

Labels are not propagated to the Google Cloud resource

Config Connector propagates labels found in metadata.labels to the underlying Google Cloud resource. However, note that not all Google Cloud resources support labels. Check the resource's REST API documentation (for example, here is the API documentation for PubSubTopic) to see if they support labels.

Failed calling webhook x509: certificate relies on legacy Common Name field

If you see an error similar to the following example, you might be experiencing an issue with certificates:

Error from server (InternalError): error when creating "/mnt/set-weaver-dns-record.yml": Internal error occurred: failed calling webhook "annotation-defaulter.cnrm.cloud.google.com": Post "https://cnrm-validating-webhook.cnrm-system.svc:443/annotation-defaulter?timeout=30s": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

To workaround this issue, delete the relevant certificates and the Pods:

kubectl delete -n cnrm-system secrets cnrm-webhook-cert-abandon-on-uninstall
kubectl delete -n cnrm-system secrets cnrm-webhook-cert-cnrm-validating-webhook
kubectl delete -n cnrm-system pods -l "cnrm.cloud.google.com/component=cnrm-webhook-manager"

After you have deleted these resources, the correct certificate regenerates.

For more information about this error, see the GitHub issue.

Error due to special characters in resource name

Special characters are not valid in the Kubernetes metadata.name field. If you see an error similar to the following example, then the resource's metadata.name likely has a value with special characters:

a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

For example, the following SQLUser resource contains an invalid character in metadata.name:

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
  name: test.example@example-project.iam
spec:
  instanceRef:
    name: test-cloudsql-db
  type: "CLOUD_IAM_USER"

If you try to create this resource, you get the following error:

Error from server (Invalid): error when creating "sqlusercrd.yaml": SQLUser.sql.cnrm.cloud.google.com "test.example@example-project.iam" is invalid: metadata.name: Invalid value: "test.example@example-project.iam": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

If you'd like to give your resource a name that is not a valid Kubernetes name, but is a valid Google Cloud resource name, you can use the resourceID, field as shown in the following example:

apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLUser
metadata:
  name: 'test'
spec:
  instanceRef:
    name: sqlinstance-sample-postgresql
  host: "%"
  type: CLOUD_IAM_USER
  resourceID: test.example@example-project.iam

This configuration causes Config Connector to use resourceID instead of metadata.name as the name of the resource.

Unable to remove fields from resource spec

Removing a field from a Config Connector resource's spec (by updating the resource's .yaml file and re-applying, or by using kubectl edit to edit the resource spec) does not actually remove that field from either the Config Connector resource's spec or the underlying Google Cloud resource. Instead, removing a field from the spec just makes that field externally-managed.

If you want to change value of a field to empty or default in the underlying Google Cloud resource, you'll have to zero-out the field in the Config Connector resource spec:

  • For list type field, set the field to an empty list by using [].

    The following example shows the targetServiceAccounts field that we want to remove:

    spec:
      targetServiceAccounts:
        - external: "foo-bar@foo-project.iam.gserviceaccount.com"
        - external: "bar@foo-project.iam.gserviceaccount.com"
    

    To remove this field, set the value to empty:

    spec:
      targetServiceAccounts: []
    
  • For primitive type field, set the field to empty by using one of the following:

    Type Empty value
    string ""
    bool "false"
    integer 0

    The following example shows the identityNamespace field that we want to remove:

    spec:
      workloadIdentityConfig:
        identityNamespace: "foo-project.svc.id.goog"
    

    To remove this field, set the value to empty:

    spec:
      workloadIdentityConfig:
        identityNamespace: ""
    
  • For object type fields, currently in Config Connector there is no easy way to set a whole object type field as "NULL". You can try to set the subfields of the object type as empty or default following the guidance above and verify if it works.

KNV2005: syncer excessively updating resource

If you are using Config Sync and you are seeing KNV2005 errors for Config Connector resources, then it is likely that Config Sync and Config Connector are fighting over the resource.

Example log message:

KNV2005: detected excessive object updates, approximately 6 times per
minute. This may indicate Config Sync is fighting with another controller over
the object.

Config Sync and Config Connector are said to be "fighting" over a resource if they keep updating the same field(s) to different values. One's update triggers the other to act and update the resource, which causes the other to act and update the resource, and so on.

Fighting is not a problem for most fields. Fields that are specified in Config Sync are not changed by Config Connector, while fields that are not specified in Config Sync and defaulted by Config Connector are ignored by Config Sync. Therefore, for most fields, Config Sync and Config Connector should never end up updating the same field to different values.

There is one exception: list fields. Similar to how Config Connector may default subfields in object fields, Config Connector may also default subfields in objects inside lists. However, since list fields in Config Connector resources are atomic, the defaulting of subfields is considered as changing the value of the list entirely.

Therefore, Config Sync and Config Connector will end up fighting if Config Sync sets a list field and Config Connector defaults any subfields within that list.

To work around this issue, you have the following options:

  1. Update the resource manifest in the Config Sync repository to match what Config Connector is trying to set the resource to.

    One way to do this is to temporarily stop syncing configs, wait for Config Connector to finish reconciling the resource, and then update your resource manifest to match the resource on the Kubernetes API Server.

  2. Stop Config Sync from reacting to updates to the resource on the Kubernetes API Server by setting the annotation client.lifecycle.config.k8s.io/mutation to ignore. Read more about how to have Config Sync ignore object mutations.

  3. Stop Config Connector from updating the resource's spec entirely by setting the annotation cnrm.cloud.google.com/state-into-spec to absent on the resource. This annotation is not supported for all resources. To see if your resource supports the annotation, check the corresponding resource reference page. Read more about the annotation.

failed calling webhook

It's possible for Config Connector to be in a state where you cannot uninstall Config Connector. This commonly happens when using the Config Connector add-on and disabling Config Connector before removing the Config Connector CRDs. When trying uninstall, you receive an error similar to the following:

error during reconciliation: error building deployment objects: error finalizing the deletion of Config Connector system components deployed by ConfigConnector controller: error waiting for CRDs to be deleted: error deleting CRD accesscontextmanageraccesslevels.accesscontextmanager.cnrm.cloud.google.com: Internal error occurred: failed calling webhook "abandon-on-uninstall.cnrm.cloud.google.com": failed to call webhook: Post "https://abandon-on-uninstall.cnrm-system.svc:443/abandon-on-uninstall?timeout=3s": service "abandon-on-uninstall" not found

To resolve this error, you must first manually delete the webhooks:

kubectl delete validatingwebhookconfiguration abandon-on-uninstall.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete validatingwebhookconfiguration validating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
kubectl delete mutatingwebhookconfiguration mutating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true

You can then proceed to uninstall Config Connector.

Update error with IAMPolicy, IAMPartialPolicy and IAMPolicyMember

If you delete an IAMServiceAccount Config Connector resource before cleaning up IAMPolicy,IAMPartialPolicy, and IAMPolicyMember resources that depend on that service account, Config Connector cannot locate the service account referenced in those IAM resources during reconciliation. This results in an UpdateFailed status with an error message like the following:

Update call failed: error setting policy member: error applying changes: summary: Request `Create IAM Members roles/[MYROLE] serviceAccount:[NAME]@[PROJECT_ID].iam.gserviceaccount.com for project \"projects/[PROJECT_ID]\"` returned error: Error applying IAM policy for project \"projects/[PROJECT_ID]\": Error setting IAM policy for project \"projects/[PROJECT_ID]\": googleapi: Error 400: Service account [NAME]@[PROJECT_ID].iam.gserviceaccount.com does not exist., badRequest

To resolve this issue, check your service accounts and see if the required service account for those IAM resources is deleted. If the service account is deleted, clean up the related IAM Config Connector resources, too. For IAMPolicyMember, delete the whole resource. For IAMPolicy and IAMParitialPolicy, only remove the bindings that involve the deleted service account. However, such cleanup doesn't remove Google Cloud role bindings immediately. The Google Cloud role bindings are retained for 60 days because of the retention on the deleted service account. For more information, see the Google Cloud IAM documentation about Delete a service account.

To avoid this issue, you should always clean up IAMPolicy, IAMPartialPolicy,IAMPolicyMember Config Connector resources before deleting the referenced IAMServiceAccount.

Resource deleted by Config Connector

Config Connector never deletes your resources without an external cause. For example, running kubectl delete, using config management tools like Argo CD, or using a customized API client can cause resource deletion.

A common misconception is that Config Connector has initiated and deleted some of the resources in your cluster. For example, if you are using Config Connector, you may notice delete requests from Config Connector controller manager against certain resources from either container log messages or Kubernetes cluster audit logs. These delete requests are a result of external triggers and Config Connector is trying to reconcile the delete requests.

To determine why a resource was deleted, you need to look into the first delete request that was sent to the corresponding resource. The best way to look into this is by examining the Kubernetes cluster audit logs.

As an example, if you are using GKE, you can use Cloud Logging to query for GKE cluster audit logs. For example, if you want to look for the initial delete requests for a BigQueryDataset resource named foo in namespace bar, you would run a query like the following:

resource.type="k8s_cluster"
resource.labels.project_id="my-project-id"
resource.labels.cluster_name="my-cluster-name"
protoPayload.methodName="com.google.cloud.cnrm.bigquery.v1beta1.bigquerydatasets.delete"
protoPayload.resourceName="bigquery.cnrm.cloud.google.com/v1beta1/namespaces/bar/bigquerydatasets/foo"

Using this query, you would look for the first delete request and then check authenticationInfo.principalEmail of that delete log message to determine the cause of the deletion.

Controller Pod OOMKilled

If you see an OOMKilled error on a Config Connector controller Pod, it indicates that a container or the entire Pod was terminated because they used more memory than allowed. This can be verified by running the kubectl describe command. The Pod's status may appear as OOMKilled or Terminating. Additionally, scrutinizing the Pod's event logs can reveal any occurrences of OOM-related events.

kubectl describe pod POD_NAME -n cnrm-system

Replace POD_NAME with the Pod you are troubleshooting.

To address this issue, you can use the ControllerResource custom resource to increase the memory request and the memory limit for the Pod.

PodSecurityPolicy prevents upgrades

After switching from the Config Connector add-on to a manual install and upgrading Config Connector to a new version, the use of PodSecurityPolicies can prevent cnrm Pods from updating.

To confirm that the PodSecurityPolicies are preventing your upgrade, check the config-connector-operator's events and look for an error similar to the following:

create Pod configconnector-operator-0 in StatefulSet configconnector-operator failed error: pods "configconnector-operator-0" is forbidden: PodSecurityPolicy: unable to admit pod: [pod.metadata.annotations[seccomp.security.alpha.kubernetes.io/pod]: Forbidden: seccomp may not be set pod.metadata.annotations[container.seccomp.security.alpha.kubernetes.io/manager]: Forbidden: seccomp may not be set]

To resolve this issue, you must specify the annotation on the PodSecurityPolicy that corresponds to the annotation mentioned in the error. In the previous example, the annotation is seccomp.security.alpha.kubernetes.io.

Forced cleanup

If your Config Connector resources are stuck on deletion and you simply want to get rid of them from your Kubernetes cluster, you can force their deletion by deleting their finalizers.

You can delete a resource's finalizers by editing the resource using kubectl edit, deleting the metadata.finalizers field, and then saving the file to preserve your changes to the Kubernetes API Server.

Since deleting a resource's finalizers allows the resource to be immediately deleted from the Kubernetes cluster, Config Connector might (but not necessarily) not get a chance to complete the deletion of the underlying Google Cloud resource. This means that you might want to manually delete your Google Cloud resources afterwards.

Monitoring

Metrics

You can use Prometheus to collect and show metrics from Config Connector.

Logging

All Config Connector Pods output structured logs in JSON format.

The logs of the controller Pods are particularly useful for debugging issues with the reconciliation of resources.

You can query for logs for specific resources by filtering for the following fields in the log messages:

  • logger: contains the resource's kind in lower-case. For example, PubSubTopic resources have a logger of pubsubtopic-controller.
  • resource.namespace: contains the resource's namespace.
  • resource.name: contains the resource's name.

Using Cloud Logging for advanced log querying

If you are using GKE, you can use Cloud Logging to query for logs for a specific resource with the following query:

# Filter to include only logs coming from the controller Pods
resource.type="k8s_container"
resource.labels.container_name="manager"
resource.labels.namespace_name="cnrm-system"
labels.k8s-pod/cnrm_cloud_google_com/component="cnrm-controller-manager"

# Filter to include only logs coming from a particular GKE cluster
resource.labels.cluster_name="GKE_CLUSTER_NAME"
resource.labels.location="GKE_CLUSTER_LOCATION"

# Filter to include only logs for a particular Config Connector resource
jsonPayload.logger="RESOURCE_KIND-controller"
jsonPayload.resource.namespace="RESOURCE_NAMESPACE"
jsonPayload.resource.name="RESOURCE_NAME"

Replace the following:

  • GKE_CLUSTER_NAME with the name of the GKE cluster running Config Connector
  • GKE_CLUSTER_LOCATION with the location of the GKE cluster running Config Connector. For example, us-central1.
  • RESOURCE_KIND with the resource's kind in lower-case. For example, pubsubtopic.
  • RESOURCE_NAMESPACE with the resource's namespace.
  • RESOURCE_NAME with the resource's name.

Additional help

To get additional help, you can file an issue on GitHub or contact Google Cloud Support.