Troubleshoot permission errors in Backup for GKE

This document describes the errors and corresponding codes you might encounter when using Backup for GKE to perform restore operations. Included in each section are things to consider when performing actions to resolve the restore errors, and instructions on how to resolve the restore operation errors.

Error 200010301: Failure to complete restore operation due to unavailable admission webhook service

Error 200010301 occurs when an attempt to complete a restore operation fails because an admission webhook service, also referred to as an HTTP callback, is unavailable, which results in the following error message. The error message indicates that the GKE API server attempted to contact an admission webhook while trying to restore a resource but the service backing the webhook was either unavailable or not found:

  resource [/example-group/ClusterSecretStore/example-store] restore failed:

  Internal error occurred: failed calling webhook "example-webhook.io":
  failed to call webhook: Post "https://example-webhook.example-namespace.svc:443/validate-example": service "example-webhook" not found.

This error occurs when a ValidatingAdmissionWebhook or MutatingAdmissionWebhook GKE resource is active in the target cluster, but the GKE API server can't reach the endpoint configured in the webhook. Admission webhooks intercept requests to the GKE API server, and their configuration specifies how the GKE API server should query the requests.

The webhook's clientConfig specifies the backend that handles the admission requests, which can be an internal cluster service or an external URL. The choice between these two options depends on the specific operational and architectural requirements of your webhook. Depending on the option type, the restore operation might have failed for the following reasons:

  • In-cluster services: the GKE service and its backing pods aren't restored or ready when the GKE API server attempted to call the webhook. This occurs during restore operations where cluster-scoped webhook configurations are applied before the namespaced services are fully in a ready state.

  • External URLs: the external endpoint is temporarily unavailable due to network connectivity issues between the GKE cluster and the external endpoint, or due to DNS resolution issues or firewall rules.

To resolve this error, use the following instructions:

  1. Identify the failing webhook mentioned in the error message. For example, failed calling webhook "...".

  2. Inspect the webhook by running the kubectl get validatingwebhookconfigurations command:

    kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yaml
    

    Replace WEBHOOK_NAME with the name of the webhook that was identified in the error message.

    You can also use the kubectl get mutatingwebhookconfigurations command to inspect the webhook:

    kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yaml
    

    Replace WEBHOOK_NAME with the name of the webhook that was identified in the error message.

  3. Perform the following troubleshooting steps based on your configuration type:

    Service-based clientConfig

    Define a custom restore order by modifying the RestorePlan resource to include a RestoreOrder with GroupKindDependency entries. This allows the components backing the webhook such as Deployment, StatefulSet, or Service to be restored and ready before the ValidatingWebhookConfiguration or MutatingWebhookConfiguration.

    For instructions on how to define a custom restore order, see Specify resource restore ordering during restoration.

    This approach can fail because the service's pods don't enter a fully ready state even after the Service object is created. Another reason for failure could be because the webhook configuration might be created unexpectedly by another application. Alternatively, you can perform a two-stage restore operation using the following steps:

    1. Create a Restore resource using the backup by configuring the restore operation with a fine-grained restore filter which would include the specific resources that are required for the webhook to function, for example, Namespaces, Deployments, StatefulSets, or Services.

      For more information on how to configure the restore with a fine-grained restore filter, see Enable fine-grained restore.

    2. Create another Restore resource for the backup operation and configure the rest of the resources you choose.

    URL-based clientConfig

    1. Verify the external HTTPS endpoint and make sure it's active, reachable, and functioning correctly.

    2. Confirm that there is network connectivity from your GKE cluster's nodes and control plane to the external URL. You might also need to check firewall rules, for example, if you're using Virtual Private Cloud, on-premises, or a cloud provider hosting the webhook, network policies, and DNS resolution.

  4. Retry the restore operation. If the operation continues to fail, contact Cloud Customer Care for further assistance.

Error 200010302: Failure to complete restore operation due to denied resource creation request

Error 200010302 occurs when an attempt to complete a restore operation fails because an admission webhook denies a resource creation request, which results in the following error message indicating that a resource from your backup couldn't be created in the target cluster because an active admission webhook intercepted the request and rejected it based on a custom policy:

  [KubeError]; e.g. resource

  [/example-namespace/example-api/ExampleResource/example-name]

  restore failed: admission webhook "example-webhook.example.com" denied the request: {reason for denial}

This error is caused by the configuration set in the target GKE cluster, which has either a ValidatingAdmissionWebhook or MutatingAdmissionWebhook that enforces specific rules on resource creation and modification, blocking the resource creation request. For example, a webhook prevents the creation of a resource because a related but conflicting resource already exists in the cluster. For example, a webhook might deny the creation of a deployment if it's already managed by a HorizontalPodAutoscaler GKE API resource.

To resolve this error, use the following instructions:

  1. Identify the webhook that is denying the request using the error message that occurs when the restore operation fails. For example, webhook WEBHOOK_NAME denied the request The error message contains the following information:

    • Webhook name: the name of the webhook denying the request.

    • Reason for denial: the specific reason for denying the request.

  2. Inspect the webhook using the kubectl get validatingwebhookconfigurations command:

    kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yaml
    

    Replace WEBHOOK_NAME with the name of the webhook you identified in the error message.

    You can also use the kubectl get mutatingwebhookconfigurations command to inspect the webhook:

    kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yaml
    

    Replace WEBHOOK_NAME with the name of the webhook you identified from the error message.

  3. Resolve the underlying issue in the target cluster. The correct action depends on the specific error. For the example, if there is an HorizontalPodAutoscaler conflict, you need to delete the existing HorizontalPodAutoscaler in the target cluster before running the restore to allow the backed-up workloads and its associated resources to be created.

  4. Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.

Error 200060202: Failure to complete restore operation due missing GKE resource during workload validation

Error 200060202 occurs during the workload validation phase of a restore operation when a GKE resource that Backup for GKE expects to validate cannot be found in the target cluster, resulting in the following error message:

  Workload Validation Error: [KIND] "[NAME]" not found

For example, Example: Workload Validation Error: pods "jenkins-0" not found

This error occurs when Backup for GKE successfully creates or updates the GKE resource as part of the restore operation's process but when the validation stage begins, one or more of the GKE resources is no longer present in the target cluster because the resource was deleted after the resource was created or updated initially by the restore process but before workload validation for the GKE resource could complete. An error like this can occur for the following reasons:

  • Manual deletion: a user or administrator manually deleted the resource using kubectl or other Google Cloud tools.

  • External automation: GitOps controllers such as Config Sync, ArgoCD, Flux, custom scripts, or other cluster management tools reverted or deleted the resource to match a desired state in a repository.

  • GKE controllers: a GKE controller deleted a resource because it conflicts with other resources or policies, or an OwnerReference chain leads to garbage collection, or the automated cleanup process by GKE that deletes dependent resources when their owner resource is deleted.

To resolve this error, use the following instructions:

  1. Identify the missing resource using the error message that appears when the restore operation fails to complete.

  2. Locate the namespace the resource belongs to using one of the following methods:

    • GKE audit logs: examine the GKE audit logs that were generated when you attempted the restore operation. You can filter logs for delete operations on the resource Kind and Name. The audit log entry contains the original namespace.

    • Backup details: review the scope of your restore operation and the contents of the backup. The backup index shows the original namespace of the resource. You can also verify if the RestorePlan contains a TransformationRule which specify rules to restore the resource in the namespace you choose.

    • Search across namespaces: use the kubectl get command to search for the resource across all namespaces:

      kubectl get KIND --all-namespaces | grep NAME
      

      Replace KIND and NAME with the values from the error message. If the resource still exists, this command will show its namespace.

  3. Verify deletion using the kubectl get command:

    kubectl get KIND NAME -n [NAMESPACE]
    

    Replace KIND and NAME with the values from the error message. You should receive a not found error message.

  4. Investigate the cause of deletion using one of the following methods:

    • GKE audit logs: identify which entity issued the deletion request. For example, the user, service account, or controller.

    • Review configured automations: If you use GitOps or other automation tools, check their logs and status to see if they interfered with the restored resources.

    • Examine related events: check GKE events in the determined namespace using the kubectl get events command:

      kubectl get events -n NAMESPACE --sort-by='.lastTimestamp'
      

      Replace NAMESPACE with the name of the namespace.

  5. Address the cause of the resource deletion based on the results of the previous step. For example, pause conflicting automations, correct misconfigurations, or adjust user permissions.

  6. Recover the missing resource using one of the following methods:

    • Re-apply manifests files: if you have the manifest for the missing resource, you can re-apply it to the correct namespace.

    • Perform a fine-grained restore: perform a fine-grained restore operation to selectively restore just the missing resource from the same backup, which ensures you specify the correct namespace. For more information about how to perform a fine-grained restore operation, see Enable fine-grained restore.

  7. Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.

What's next