This document describes the errors and corresponding codes you might encounter when using Backup for GKE to perform restore operations. Included in each section are things to consider when performing actions to resolve the restore errors, and instructions on how to resolve the restore operation errors.
Error 200010301: Failure to complete restore operation due to unavailable admission webhook service
Error 200010301
occurs when an attempt to complete a restore operation fails
because an admission webhook service, also referred to as an HTTP callback, is unavailable, which results in the following error message. The error message
indicates that the GKE API server attempted to contact an
admission webhook while trying to restore a resource but the service backing the webhook was either unavailable or not found:
resource [/example-group/ClusterSecretStore/example-store] restore failed:
Internal error occurred: failed calling webhook "example-webhook.io":
failed to call webhook: Post "https://example-webhook.example-namespace.svc:443/validate-example": service "example-webhook" not found.
This error occurs when a ValidatingAdmissionWebhook
or
MutatingAdmissionWebhook
GKE resource is active in the target cluster, but the GKE API server can't reach the endpoint
configured in the webhook. Admission webhooks intercept requests to the GKE API server, and their configuration specifies how the GKE API server should query the requests.
The webhook's clientConfig
specifies the backend that handles the admission requests, which can be an internal cluster service or an external URL. The
choice between these two options depends on the specific operational and architectural requirements of your webhook. Depending on the option type,
the restore operation might have failed for the following reasons:
In-cluster services: the GKE service and its backing pods aren't restored or ready when the GKE API server attempted to call the webhook. This occurs during restore operations where cluster-scoped webhook configurations are applied before the namespaced services are fully in a
ready
state.External URLs: the external endpoint is temporarily unavailable due to network connectivity issues between the GKE cluster and the external endpoint, or due to DNS resolution issues or firewall rules.
To resolve this error, use the following instructions:
Identify the failing webhook mentioned in the error message. For example,
failed calling webhook "..."
.Inspect the webhook by running the
kubectl get validatingwebhookconfigurations
command:kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yaml
Replace
WEBHOOK_NAME
with the name of the webhook that was identified in the error message.You can also use the
kubectl get mutatingwebhookconfigurations
command to inspect the webhook:kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yaml
Replace
WEBHOOK_NAME
with the name of the webhook that was identified in the error message.Perform the following troubleshooting steps based on your configuration type:
Service-based
clientConfig
Define a custom restore order by modifying the
RestorePlan
resource to include aRestoreOrder
withGroupKindDependency
entries. This allows the components backing the webhook such asDeployment
,StatefulSet
, orService
to be restored and ready before theValidatingWebhookConfiguration
orMutatingWebhookConfiguration
.For instructions on how to define a custom restore order, see Specify resource restore ordering during restoration.
This approach can fail because the service's pods don't enter a fully
ready
state even after theService
object is created. Another reason for failure could be because the webhook configuration might be created unexpectedly by another application. Alternatively, you can perform a two-stage restore operation using the following steps:Create a
Restore
resource using the backup by configuring the restore operation with a fine-grained restore filter which would include the specific resources that are required for the webhook to function, for example,Namespaces
,Deployments
,StatefulSets
, orServices
.For more information on how to configure the restore with a fine-grained restore filter, see Enable fine-grained restore.
Create another
Restore
resource for the backup operation and configure the rest of the resources you choose.
URL-based
clientConfig
Verify the external HTTPS endpoint and make sure it's active, reachable, and functioning correctly.
Confirm that there is network connectivity from your GKE cluster's nodes and control plane to the external URL. You might also need to check firewall rules, for example, if you're using Virtual Private Cloud, on-premises, or a cloud provider hosting the webhook, network policies, and DNS resolution.
Retry the restore operation. If the operation continues to fail, contact Cloud Customer Care for further assistance.
Error 200010302: Failure to complete restore operation due to denied resource creation request
Error 200010302
occurs when an attempt to complete a restore operation fails
because an admission webhook denies a resource creation request, which results
in the following error message indicating that a resource from your backup
couldn't be created in the target cluster because an active admission webhook
intercepted the request and rejected it based on a custom policy:
[KubeError]; e.g. resource
[/example-namespace/example-api/ExampleResource/example-name]
restore failed: admission webhook "example-webhook.example.com" denied the request: {reason for denial}
This error is caused by the configuration set in the target GKE
cluster, which has either a ValidatingAdmissionWebhook
or
MutatingAdmissionWebhook
that enforces specific rules on resource creation and modification, blocking the resource creation request.
For example, a webhook prevents the creation of a resource because a related but conflicting resource already exists in the cluster. For example, a webhook
might deny the creation of a deployment if it's already managed by a HorizontalPodAutoscaler
GKE API resource.
To resolve this error, use the following instructions:
Identify the webhook that is denying the request using the error message that occurs when the restore operation fails. For example,
webhook WEBHOOK_NAME denied the request
The error message contains the following information:Webhook name: the name of the webhook denying the request.
Reason for denial: the specific reason for denying the request.
Inspect the webhook using the
kubectl get validatingwebhookconfigurations
command:kubectl get validatingwebhookconfigurations WEBHOOK_NAME -o yaml
Replace
WEBHOOK_NAME
with the name of the webhook you identified in the error message.You can also use the
kubectl get mutatingwebhookconfigurations
command to inspect the webhook:kubectl get mutatingwebhookconfigurations WEBHOOK_NAME -o yaml
Replace
WEBHOOK_NAME
with the name of the webhook you identified from the error message.Resolve the underlying issue in the target cluster. The correct action depends on the specific error. For the example, if there is an
HorizontalPodAutoscaler
conflict, you need to delete the existingHorizontalPodAutoscaler
in the target cluster before running the restore to allow the backed-up workloads and its associated resources to be created.Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.
Error 200060202: Failure to complete restore operation due missing GKE resource during workload validation
Error 200060202
occurs during the workload validation phase of a restore
operation when a GKE resource that Backup for GKE expects to validate cannot be found in the target cluster, resulting in the following
error message:
Workload Validation Error: [KIND] "[NAME]" not found
For example, Example: Workload Validation Error: pods "jenkins-0" not found
This error occurs when Backup for GKE successfully creates or updates the GKE resource as part of the restore operation's process but when the validation stage begins, one or more of the GKE resources is no longer present in the target cluster because the resource was deleted after the resource was created or updated initially by the restore process but before workload validation for the GKE resource could complete. An error like this can occur for the following reasons:
Manual deletion: a user or administrator manually deleted the resource using
kubectl
or other Google Cloud tools.External automation: GitOps controllers such as Config Sync, ArgoCD, Flux, custom scripts, or other cluster management tools reverted or deleted the resource to match a desired state in a repository.
GKE controllers: a GKE controller deleted a resource because it conflicts with other resources or policies, or an
OwnerReference
chain leads to garbage collection, or the automated cleanup process by GKE that deletes dependent resources when theirowner
resource is deleted.
To resolve this error, use the following instructions:
Identify the missing resource using the error message that appears when the restore operation fails to complete.
Locate the namespace the resource belongs to using one of the following methods:
GKE audit logs: examine the GKE audit logs that were generated when you attempted the restore operation. You can filter logs for delete operations on the resource
Kind
andName
. The audit log entry contains the original namespace.Backup details: review the scope of your restore operation and the contents of the backup. The backup index shows the original namespace of the resource. You can also verify if the
RestorePlan
contains aTransformationRule
which specify rules to restore the resource in the namespace you choose.Search across namespaces: use the
kubectl get
command to search for the resource across all namespaces:kubectl get KIND --all-namespaces | grep NAME
Replace
KIND
andNAME
with the values from the error message. If the resource still exists, this command will show its namespace.
Verify deletion using the
kubectl get
command:kubectl get KIND NAME -n [NAMESPACE]
Replace
KIND
andNAME
with the values from the error message. You should receive anot found
error message.Investigate the cause of deletion using one of the following methods:
GKE audit logs: identify which entity issued the deletion request. For example, the user, service account, or controller.
Review configured automations: If you use GitOps or other automation tools, check their logs and status to see if they interfered with the restored resources.
Examine related events: check GKE events in the determined namespace using the
kubectl get events
command:kubectl get events -n NAMESPACE --sort-by='.lastTimestamp'
Replace
NAMESPACE
with the name of the namespace.
Address the cause of the resource deletion based on the results of the previous step. For example, pause conflicting automations, correct misconfigurations, or adjust user permissions.
Recover the missing resource using one of the following methods:
Re-apply manifests files: if you have the manifest for the missing resource, you can re-apply it to the correct namespace.
Perform a fine-grained restore: perform a fine-grained restore operation to selectively restore just the missing resource from the same backup, which ensures you specify the correct namespace. For more information about how to perform a fine-grained restore operation, see Enable fine-grained restore.
Retry the restore operation. If the restore operation continues to fail, contact Cloud Customer Care for further assistance.