Error causes cluster primaries' repair loop

Problem

You notice that no kubectl commands, to either create, or delete pods could be run against the cluster as Primaries were down due to a repair loop.
all cluster resources were brought up, but: component "etcd-0" from endpoint "gke-<primary hash>-<wxyz>" is unhealthy
all cluster resources were brought up, but: component "kube-apiserver" from endpoint "gke-<primary hash>-<wxyz>" is unhealthy.

Environment

  • Google Kubernetes Engine
  • Regional cluster with 3 Primaries

Solution

  1. Manually resize the DB size of Etcd on each Primary instance by setting --quota-backend-bytes to the Max value of 8 GB from 6.5 GB.
  2. Perform the necessary cleanup tasks using kubectl commands to delete any unnecessary/erroneous k8s objects that caused the DB to fill up.

Cause

These errors could be caused due to numerous batch or job resource objects created by an automation pipeline and not cleaned up.

As such, hundreds of old Jobs (and pods) going undeleted, with large payloads in their arguments, sitting in the db, possibly leading to the overflow of the etcd db. Most of these objects were in one namespace. 

Any attempts to delete the objects and free up space failed as the kube-apiserver in primaries were unhealthy.