Cluster in "entering repair" state randomly

Problem

Your Google Kubernetes Engine cluster might randomly move to entering repair state.

Environment

  • Google Kubernetes Engine

Solution

  1. In the Primary API Server logs, you may see the following errors:
    "Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admit?timeout=3s": context deadline exceeded"
    
    "Failed calling webhook, failing open mutation.gatekeeper.sh: failed calling webhook "mutation.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/mutate?timeout=3s": context deadline exceeded"
    
  2. Go to your project and check the object viewer configuration and locate the webhooks, validation.gatekeeper.sh and mutation.gatekeeper.sh, and see the scope defined in them. If it is just *, this means that it is defined for all resources. 

You should narrow it down to the resources that concern you and not all of the resources. This is actually too broad for it to pass it within the decided timelines.