Containers & Kubernetes
How Blibli manages updates for stateful and stateless applications on GKE
At Blibli, an Indonesian business-to-consumer Ecommerce provider, we run most of our IT infrastructure— including both stateful and stateless applications such as Redis, RabbitMQ, Spring Boot, Jenkins, and Grafana—on Google Kubernetes Engine (GKE).
GKE provides a scalable and reliable managed service of Kubernetes. It integrates well with other Google Cloud services. And on GKE, we’re now saving more than 30% of infrastructure costs. But like many companies with a lot on their plate and multiple tasks underway, we were once too busy to focus on operations like cluster and node pool updates. Consequently, we fell way behind the current version in GKE.
To comply with both service-provider and open-source software policies, you must stay on top of version updates. And since Google Cloud releases its Kubernetes clusters in three-month cycles, this can be a challenge when running workloads in GKE. But recently, we updated our GKE cluster from version 1.13.x to 1.15.x and tested the same update across different clusters and environments—without service interruptions.
You can read the release notes and changelogs of the version you plan to upgrade to, so I won’t belabor every detail of our update process. But read on to learn how we keep our GKE clusters up to date with newly released versions and how you can too.
Updating a GKE cluster without downtime
We manage our GKE cluster and everything related to it using Terraform and GitOps, which help to simplify the update process.
With a regional cluster, you can avoid downtime because GKE maintains replicas of the control plane across all the zones. So your cluster is resilient to single-zone failure. Double check the resource availability for this activity.
Updating a cluster is a two-step process: First, control plane then node pools, which require a handful of critical network considerations.
The control plane in Kubernetes includes the Kubernetes API server, the scheduler, and the controller manager server. The control-plane upgrade is quite simple since GKE manages it for you with a simple click (in our case, changing the variable in Terraform). This update takes several minutes during which you won’t be able to change the cluster’s configuration. But your workloads will function perfectly.
By default, a cluster's nodes have auto-update enabled, and Google Cloud recommends that you keep it that way. If you’ve opted for auto-update then GKE does the magic for you. You can just sit back and relax.
Unlike your control-plane update, the process has a lot of visibility. It is also highly dependent on the total number of nodes in the cluster. Sequentially, for each node in the node pool, nodes are stopped from scheduling node Pods, existing Pods are drained, and finally, the node is updated.
Like us, if you need to carefully manage dependencies and qualifications, you may elect to manage your own upgrade workflow. Surge upgrades let you control the number of nodes GKE can update at a time and control how disruptive updates are to your workloads. There are also several options when you decide to update the worker nodes. One obvious way is to manually trigger the update, which parallels the auto-update process except you decide when it occurs.
gcloud container clusters upgrade cluster-name --node-pool=node-pool-name --cluster-version cluster-version
Fun fact: You can manually update a node pool version to match the version of the control plane or a previous version that is still available and compatible with the control plane. The Kubernetes version and version skew support policy guarantees that control planes are compatible with nodes up to two minor versions older than the control plane. For example, Kubernetes 1.13.x control planes are compatible with Kubernetes 1.11.x nodes.
Although GKE can update large clusters quickly, we thought that when running over 100 nodes and not using surge upgrades in GKE, it might take forever to drain all the nodes and upgrade. So our strategy here was to depend on the continuous deployment of our application and also to avoid downtime.
Here comes the interesting implementation part you were looking for. Rather than updating the current nodes of a node pool, we created a new node pool with the updated version—but with a twist. The new node pool had different taints relative to the node pool with the old version. Can you guess the next step? It’s simple: We deployed our applications matching the taints of the node pool with the new version. Again, you can too.
But (there’s always a “but”) to prevent downtime, you need to ensure that your update strategy is a rolling one. And you must confirm a couple other things before deploying the applications.
Since the node pool you just created has a different taint, the first thing you need to ensure once the nodes in the new node pool are spawned is that the DaemonSets are deployed and running perfectly.
Pod Disruption Budget (PDB)
PDB is a mechanism by which you allow for a number/percentage of pods to be terminated. Since the number of replica and PDB go hand in hand, we set the PDB for our workloads to maxUnavailable: 1. This gives the confidence that at any point in the application deployment at least one Pod is running.
I know stateful sets are challenging, but GKE offers a variety of ways to manage them. To prevent down time, we use the following checklist, and you can as well. Additionally, you can take snapshots of the Persistent Volume Claims.
Basically, with this method of updating your node pools, you are doubling the size of your current running cluster during the process. So make sure you have sufficient resources available to meet network requirements, such as primary IPs or NAT IPs in case your nodes talk to the outside world.
The packets in the egress traffic will not be masqueraded and the pod IP will be visible if you:
Are running a version older than 1.14.x, and
Don’t have the ip-masq-agent running, and
Have a destination address range that falls under the CIDRs 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16
One way to address this is to add the ip-masq-agent and list the destination CIDRs for the nonMasqueradeCIDRs configuration.GKE is a versatile platform that provides great value to manage microservices and run them efficiently. For a fully managed, highly effective but low maintenance Kubernetes solution, consider using GKE. Then you can focus on exploring rather than managing the underlying system.