Developers & Practitioners

Scaling workloads across multiple dimensions in GKE

#containers

In Google Kubernetes Engine (GKE), application owners can define multiple autoscaling behaviors for a workload using a single Kubernetes resource: Multidimensional Pod Autoscaler (MPA). 

The challenges of scaling Pods horizontally and vertically

The success of Kubernetes as a widely adopted platform is grounded in its support for a variety of workloads and their many requirements. One of the areas that has continuously improved over time is workload autoscaling. 

Dating back to the early days of Kubernetes, Horizontal Pod Autoscaler (HPA) was the primary mechanism for autoscaling Pods. By the very nature of its name, it provided users the ability to have Pod replicas added when a user-defined threshold of a given metric was crossed. Early on this was typically CPU or Memory usage, though now there's support for custom and external metrics.

A bit further down the line, Vertical Pod Autoscaler (VPA) added a new dimension to workload autoscaling. Much like its name suggests, VPA has the ability to make recommendations on the best amount of CPU or Memory that Pods should be requesting based on usage patterns. Users can then either review those recommendations and make the call as to whether or not they should be applied, or entrust VPA to apply those changes automatically on their behalf.

Naturally, Kubernetes users have sought to get the benefits from both of these forms of scaling.

While these autoscalers work well independent of one another, the results of running both at the same time can produce unexpected results.

Picture an example where HPA adjusts the number of replicas for a Pod to maintain a target 50% CPU utilization. VPA, when configured to automatically apply recommendations, could fall into a loop of continuously shrinking CPU requests – a direct result of HPA maintaining its relatively low target for CPU utilization!

image 1

Part of the challenge here is that when configured to act autonomously, VPA applies changes for both CPU and memory. Thus, the contention can be difficult to avoid as long as VPA is automatically applying changes.

Users have since accepted compromises in one of two ways: 

  • Using HPA to scale on CPU or memory and using VPA only for recommendations, building their own automation to review and actually apply the recommendations
  • Using VPA to automatically apply changes to CPU and memory, while using HPA based on custom or external metrics

While these workarounds are suitable for a handful of use cases, there are still workloads that would benefit from autoscaling across the dimensions of both CPU and memory. 

For example, web applications may require horizontal autoscaling on CPU when CPU bound – but may also desire vertical autoscaling on memory for reliability in the event of misconfigured memory that results in OOMkilled events for the container.

Multidimensional Pod Autoscaler 

The first feature available in MPA allows users to scale Pods horizontally based on CPU utilization and vertically based on memory, available in GKE clusters versions 1.19.4-gke.1700 or newer.
md pod autoscaler

In the MPA schema, there are two critical constructs that enable users to configure their desired behavior: goals and constraints. See the below manifest for an MPA resource, which has been shortened for readability:

  # mpa-example.yaml

...  
  goals:
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  constraints:
    global:
      minReplicas: 1
      maxReplicas: 10
    containerControlledResources: [ memory ]
  policy:
    updateMode: Auto
...

Goals allow for users to define targets for metrics. The first supported metric is target CPU utilization, similar to how users define target CPU utilization in an HPA resource. The MPA will attempt to ensure that these goals are met by distributing load across additional replicas of a given Pod.

Constraints, on the other hand, are a bit more stringent. These take precedence over goals, and can be applied either to global targets – think min and max replicas of a given Pod – or specific resources. In the case of vertical autoscaling, this is where users get to a.) specify that memory is controlled by MPA and b.) define the upper and lower boundaries for memory requests for a given Pod should they need to.

Let's test this out! 

We'll use Cloud Shell as our workstation and create a GKE cluster with a version that supports MPA:

  $ gcloud beta container clusters create "mpa-sandbox" \
  --cluster-version "1.20" \
  --zone "us-west1-a" \
  --enable-vertical-pod-autoscaling

NAME         LOCATION    MASTER_VERSION   MASTER_IP       MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
mpa-sandbox  us-west1-a  1.20.7-gke.1800  35.203.191.106  e2-medium     1.20.7-gke.1800  3          RUNNING

We'll use the standard php-apache example Pods from the Kubernetes documentation on HPA. These manifests will create three Kubernetes objects - a Deployment, a Service, and a Multidimensional Pod Autoscaler.

  $ kubectl apply -f https://raw.githubusercontent.com/agmsb/gke-mpa/main/php-apache-mpa.yaml

deployment.apps/php-apache created
service/php-apache created
multidimpodautoscaler.autoscaling.gke.io/php-apache-mpa created

The Deployment consists of a php-apache Pod, is exposed via a Service `type: LoadBalancer`, and is managed by a Multidimensional Pod Autoscaler (MPA).  

The Pod template in the Deployment is configured to request 100 millicores in CPU and 50 mebibytes in memory. The MPA is configured to aim for 60% CPU utilization and adjusting Pod memory requests based on usage.

Once we have the resources deployed, grab the External IP address for the php-apache Service. Your External IP address will likely vary from the below example. 

  $ kubectl get svc -w | grep php-apache

php-apache   LoadBalancer   10.115.250.207   35.247.69.134   80:30045/TCP   43s

We will then use the `hey` utility to send artificial traffic to our php-apache Pods and thus trigger action from the MPA. We will be sending traffic to the Pods via the Load Balancer's external IP address.

  $ export VIP=35.247.69.134 && hey -z 1000s -c 1000 http://$VIP

The MPA will then scale the Deployment horizontally, adding Pod replicas to handle the incoming traffic. 

Run the below command in a separate terminal to watch the example app scale up.

  $ kubectl get pods -w

NAME                          READY   STATUS    RESTARTS   AGE
php-apache-865f785f6f-4kb74   1/1     Running   0          3m37s
php-apache-865f785f6f-7669f   1/1     Running   0          42s
php-apache-865f785f6f-78qh4   1/1     Running   0          10s
php-apache-865f785f6f-f79vf   1/1     Running   0          3m37s
php-apache-865f785f6f-g5th7   1/1     Running   0          3m37s
php-apache-865f785f6f-hcxqj   1/1     Running   0          3m37s
php-apache-865f785f6f-hzxf9   1/1     Running   0          42s
php-apache-865f785f6f-jt4kf   1/1     Running   0          10s
php-apache-865f785f6f-wc6mb   1/1     Running   0          3m37s
php-apache-865f785f6f-xm5wd   1/1     Running   0          10s

We can also observe the amount of CPU and memory each Pod replica is using:

  $ kubectl top pods --use-protocol-buffers

NAME                          CPU(cores)   MEMORY(bytes)   
php-apache-865f785f6f-4kb74   641m         146Mi           
php-apache-865f785f6f-7669f   466m         124Mi           
php-apache-865f785f6f-78qh4   653m         99Mi            
php-apache-865f785f6f-f79vf   638m         146Mi           
php-apache-865f785f6f-g5th7   623m         146Mi           
php-apache-865f785f6f-hcxqj   466m         146Mi           
php-apache-865f785f6f-hzxf9   624m         135Mi           
php-apache-865f785f6f-jt4kf   473m         118Mi           
php-apache-865f785f6f-wc6mb   471m         144Mi           
php-apache-865f785f6f-xm5wd   625m         48Mi

In the output from the previous command, Pods should be utilizing well over the memory requests that we specified in the Deployment. Digging into the MPA object, we can see that the MPA notices that as well, recommending an increase in memory requests.

  $ kubectl describe mpa
Recommended Pod Resources:
    Container Recommendations:
      Container Name:  php-apache
      Lower Bound:
        Memory:  78643200
      Target:
        Memory:  179306496
      Uncapped Target:
        Memory:  179306496
      Upper Bound:
        Memory:  81285611520
...

Eventually, we should see MPA actuate these recommendations and scale the Pods vertically. 

We will know this is complete by observing an annotation in any of the Pod replicas that denotes action was taken by the MPA, as well as the new memory requests adjusted to reflect the MPA's action.

  $ kubectl describe pod <pod-name>
...
Annotations:  vpaObservedContainers: php-apache
              vpaUpdates: Pod resources updated by php-apache-mpa: container 0: memory request
...
   Requests:
      cpu:        100m
      memory:     171966464
...

Conclusion

Multidimensional Pod Autoscaler solves a challenge that many GKE users have faced, exposing a new method to control horizontal and vertical autoscaling via a single resource. Try it in GKE versions 1.19.4-gke.1700+, and stay tuned for additional functionality in MPA!

A special thanks to Mark Mirchandani, Jerzy Foryciarz, Kaslin Fields, Marcin Wielgus, and Tomek Weksej for their contributions to this blog post.