Developers & Practitioners
Scaling workloads across multiple dimensions in GKE
In Google Kubernetes Engine (GKE), application owners can define multiple autoscaling behaviors for a workload using a single Kubernetes resource: Multidimensional Pod Autoscaler (MPA).
The challenges of scaling Pods horizontally and vertically
The success of Kubernetes as a widely adopted platform is grounded in its support for a variety of workloads and their many requirements. One of the areas that has continuously improved over time is workload autoscaling.
Dating back to the early days of Kubernetes, Horizontal Pod Autoscaler (HPA) was the primary mechanism for autoscaling Pods. By the very nature of its name, it provided users the ability to have Pod replicas added when a user-defined threshold of a given metric was crossed. Early on this was typically CPU or Memory usage, though now there's support for custom and external metrics.
A bit further down the line, Vertical Pod Autoscaler (VPA) added a new dimension to workload autoscaling. Much like its name suggests, VPA has the ability to make recommendations on the best amount of CPU or Memory that Pods should be requesting based on usage patterns. Users can then either review those recommendations and make the call as to whether or not they should be applied, or entrust VPA to apply those changes automatically on their behalf.
Naturally, Kubernetes users have sought to get the benefits from both of these forms of scaling.
While these autoscalers work well independent of one another, the results of running both at the same time can produce unexpected results.
Picture an example where HPA adjusts the number of replicas for a Pod to maintain a target 50% CPU utilization. VPA, when configured to automatically apply recommendations, could fall into a loop of continuously shrinking CPU requests – a direct result of HPA maintaining its relatively low target for CPU utilization!
Part of the challenge here is that when configured to act autonomously, VPA applies changes for both CPU and memory. Thus, the contention can be difficult to avoid as long as VPA is automatically applying changes.
Users have since accepted compromises in one of two ways:
- Using HPA to scale on CPU or memory and using VPA only for recommendations, building their own automation to review and actually apply the recommendations
- Using VPA to automatically apply changes to CPU and memory, while using HPA based on custom or external metrics
While these workarounds are suitable for a handful of use cases, there are still workloads that would benefit from autoscaling across the dimensions of both CPU and memory.
For example, web applications may require horizontal autoscaling on CPU when CPU bound – but may also desire vertical autoscaling on memory for reliability in the event of misconfigured memory that results in OOMkilled events for the container.
Multidimensional Pod AutoscalerThe first feature available in MPA allows users to scale Pods horizontally based on CPU utilization and vertically based on memory, available in GKE clusters versions 1.19.4-gke.1700 or newer.
In the MPA schema, there are two critical constructs that enable users to configure their desired behavior: goals and constraints. See the below manifest for an MPA resource, which has been shortened for readability:
- type: Resource
containerControlledResources: [ memory ]
Goals allow for users to define targets for metrics. The first supported metric is target CPU utilization, similar to how users define target CPU utilization in an HPA resource. The MPA will attempt to ensure that these goals are met by distributing load across additional replicas of a given Pod.
Constraints, on the other hand, are a bit more stringent. These take precedence over goals, and can be applied either to global targets – think min and max replicas of a given Pod – or specific resources. In the case of vertical autoscaling, this is where users get to a.) specify that memory is controlled by MPA and b.) define the upper and lower boundaries for memory requests for a given Pod should they need to.
Let's test this out!
We'll use Cloud Shell as our workstation and create a GKE cluster with a version that supports MPA:
$ gcloud beta container clusters create "mpa-sandbox" \
--cluster-version "1.20" \
--zone "us-west1-a" \
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
mpa-sandbox us-west1-a 1.20.7-gke.1800 22.214.171.124 e2-medium 1.20.7-gke.1800 3 RUNNING
We'll use the standard php-apache example Pods from the Kubernetes documentation on HPA. These manifests will create three Kubernetes objects - a Deployment, a Service, and a Multidimensional Pod Autoscaler.
$ kubectl apply -f https://raw.githubusercontent.com/agmsb/gke-mpa/main/php-apache-mpa.yaml
The Deployment consists of a php-apache Pod, is exposed via a Service `type: LoadBalancer`, and is managed by a Multidimensional Pod Autoscaler (MPA).
The Pod template in the Deployment is configured to request 100 millicores in CPU and 50 mebibytes in memory. The MPA is configured to aim for 60% CPU utilization and adjusting Pod memory requests based on usage.
Once we have the resources deployed, grab the External IP address for the php-apache Service. Your External IP address will likely vary from the below example.
$ kubectl get svc -w | grep php-apache
php-apache LoadBalancer 10.115.250.207 126.96.36.199 80:30045/TCP 43s
We will then use the `hey` utility to send artificial traffic to our php-apache Pods and thus trigger action from the MPA. We will be sending traffic to the Pods via the Load Balancer's external IP address.
$ export VIP=188.8.131.52 && hey -z 1000s -c 1000 http://$VIP
The MPA will then scale the Deployment horizontally, adding Pod replicas to handle the incoming traffic.
Run the below command in a separate terminal to watch the example app scale up.
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
php-apache-865f785f6f-4kb74 1/1 Running 0 3m37s
php-apache-865f785f6f-7669f 1/1 Running 0 42s
php-apache-865f785f6f-78qh4 1/1 Running 0 10s
php-apache-865f785f6f-f79vf 1/1 Running 0 3m37s
php-apache-865f785f6f-g5th7 1/1 Running 0 3m37s
php-apache-865f785f6f-hcxqj 1/1 Running 0 3m37s
php-apache-865f785f6f-hzxf9 1/1 Running 0 42s
php-apache-865f785f6f-jt4kf 1/1 Running 0 10s
php-apache-865f785f6f-wc6mb 1/1 Running 0 3m37s
php-apache-865f785f6f-xm5wd 1/1 Running 0 10s
We can also observe the amount of CPU and memory each Pod replica is using:
$ kubectl top pods --use-protocol-buffers
NAME CPU(cores) MEMORY(bytes)
php-apache-865f785f6f-4kb74 641m 146Mi
php-apache-865f785f6f-7669f 466m 124Mi
php-apache-865f785f6f-78qh4 653m 99Mi
php-apache-865f785f6f-f79vf 638m 146Mi
php-apache-865f785f6f-g5th7 623m 146Mi
php-apache-865f785f6f-hcxqj 466m 146Mi
php-apache-865f785f6f-hzxf9 624m 135Mi
php-apache-865f785f6f-jt4kf 473m 118Mi
php-apache-865f785f6f-wc6mb 471m 144Mi
php-apache-865f785f6f-xm5wd 625m 48Mi
In the output from the previous command, Pods should be utilizing well over the memory requests that we specified in the Deployment. Digging into the MPA object, we can see that the MPA notices that as well, recommending an increase in memory requests.
$ kubectl describe mpa
Recommended Pod Resources:
Container Name: php-apache
Eventually, we should see MPA actuate these recommendations and scale the Pods vertically.
We will know this is complete by observing an annotation in any of the Pod replicas that denotes action was taken by the MPA, as well as the new memory requests adjusted to reflect the MPA's action.
$ kubectl describe pod <pod-name>
Annotations: vpaObservedContainers: php-apache
vpaUpdates: Pod resources updated by php-apache-mpa: container 0: memory request
Multidimensional Pod Autoscaler solves a challenge that many GKE users have faced, exposing a new method to control horizontal and vertical autoscaling via a single resource. Try it in GKE versions 1.19.4-gke.1700+, and stay tuned for additional functionality in MPA!
A special thanks to Mark Mirchandani, Jerzy Foryciarz, Kaslin Fields, Marcin Wielgus, and Tomek Weksej for their contributions to this blog post.