This tutorial shows how to scale your GKE workloads down to zero Pods using KEDA. Scaling the deployments to zero Pods saves resources during periods of inactivity (such as weekends and non-office hours), or for intermittent workloads such as periodic jobs.
Objectives
This tutorial describes the following use cases:
- Scale your Pub/Sub workload to zero: Scale the number of Pods in proportion to the number of messages queued on the Pub/Sub topic. When the queue is empty, the workload automatically scales down to zero Pods.
- Scale your LLM workload to zero. Deploy your LLM model servers on nodes with GPU. When the service is idle, the workload automatically scales down to zero Pods.
Costs
In this document, you use the following billable components of Google Cloud:
- GKE
- GPU resources used by GKE
- Pub/Sub
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
In this tutorial, you use Cloud Shell to run commands. Cloud Shell is a shell environment for managing resources hosted on Google Cloud. It comes preinstalled with the Google Cloud CLI, kubectl, Helm and Terraform command-line tools. If you don't use Cloud Shell, you must install the Google Cloud CLI and Helm.
-
To run the commands on this page, set up the gcloud CLI in one of the following development environments:
Cloud Shell
To use an online terminal with the gcloud CLI already set up, activate Cloud Shell:
At the bottom of this page, a Cloud Shell session starts and displays a command-line prompt. It can take a few seconds for the session to initialize.
Local shell
To use a local development environment, follow these steps:
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Resource Manager, Compute Engine, GKE, Pub/Sub APIs.
Setting up your environment
To set up your environment with Cloud Shell, follow these steps:
Set environment variables:
export PROJECT_ID=PROJECT_ID export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format 'get(projectNumber)') export LOCATION=LOCATION
Replace
PROJECT_ID
with your Google Cloud project ID andLOCATION
with the regions or zones where your GKE cluster should be created.If you don't follow the entire tutorial in a single session, or if your environment variables are unset for some reason, make sure to run this command again to set the variables again.
Create a Standard GKE cluster with cluster autoscaling and Workload Identity Federation for GKE enabled:
gcloud container clusters create scale-to-zero \ --project=${PROJECT_ID} --location=${LOCATION} \ --machine-type=n1-standard-2 \ --enable-autoscaling --min-nodes=1 --max-nodes=5 \ --workload-pool=${PROJECT_ID}.svc.id.goog
Install KEDA
KEDA is a component that complements Kubernetes Horizontal Pod Autoscaler. With KEDA, you can scale a Deployment to zero Pods and up from zero Pods to one Pod. The standard Horizontal Pod Autoscaler algorithm applies after GKE creates at least one Pod.
After GKE scales the Deployment to zero Pods, because no Pods are running, autoscaling cannot rely on Pod metrics such as CPU utilization. As a consequence, KEDA allows fetching metrics originating from outside the cluster using an implementation of the Kubernetes External Metrics API. You can use this API to autoscale based on metrics such as the number of outstanding messages on a Pub/Sub subscription. See the KEDA documentation for a list of all supported metric sources.
Install KEDA on your cluster with Helm or with kubectl
.
Helm
Run the following commands to add the KEDA Helm repository, install the KEDA Helm chart, and give the KEDA service account read access to Cloud Monitoring:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --create-namespace --namespace keda
gcloud projects add-iam-policy-binding projects/${PROJECT_ID} \
--role roles/monitoring.viewer \
--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator
Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.
kubectl
Run the following commands to install KEDA using kubectl apply
and to give
the KEDA service account read access to Cloud Monitoring:
kubectl apply --server-side -f https://github.com/kedacore/keda/releases/download/v2.15.1/keda-2.15.1.yaml
gcloud projects add-iam-policy-binding projects/${PROJECT_ID} \
--role roles/monitoring.viewer \
--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator
Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.
Confirm that all KEDA resources appear under the keda
namespace:
kubectl get all -n keda
For more information about KEDA design and resources, see the KEDA documentation.
Scale your Pub/Sub workload to zero
This section describes a workload that processes messages from a Pub/Sub subscription, handling each message and acknowledging its completion. The workload scales dynamically: as the number of unacknowledged messages increases, autoscaling instantiates more Pods to ensure timely processing.
Scaling to zero ensures that no Pods are instantiated when no messages have been received for a while. This saves resources as no Pods stay idle for long periods of time.
Deploy a Pub/Sub workload
Deploy a sample workload that processes messages queued on a Pub/Sub
topic. To simulate a realistic workload, this sample program waits three seconds
before acknowledging a message. The workload is configured to run under the
keda-pubsub-sa
service account.
Run the following commands to create the Pub/Sub topic and subscription,
configure their permission, and create the Deployment starting the workload
under the keda-pubub
namespace.
gcloud pubsub topics create keda-echo
gcloud pubsub subscriptions create keda-echo-read --topic=keda-echo
gcloud projects add-iam-policy-binding projects/${PROJECT_ID} \
--role=roles/pubsub.subscriber \
--member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda-pubsub/sa/keda-pubsub-sa
kubectl apply -f https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-with-workload-identity.yaml
Configure scale-to-zero
To configure your Pub/Sub workload to scale to zero, use KEDA to
define a ScaledObject
resource to specify how the deployment should scale.
KEDA will then automatically create and manage the underlying
HorizontalPodAutoscaler
(HPA) object.
Create the
ScaledObject
resource to describe the expected autoscaling behavior:envsubst < https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-scaledobject.yaml | kubectl apply -f -
This creates the following object:
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: keda-pubsub namespace: keda-pubsub spec: maxReplicaCount: 5 scaleTargetRef: name: keda-pubsub triggers: - type: gcp-pubsub authenticationRef: name: keda-auth metadata: subscriptionName: "projects/${PROJECT_ID}/subscriptions/keda-echo-read"
Inspect the
HorizontalPodAutoscaler
(HPA) object that KEDA creates based on theScaledObject
object:kubectl get hpa keda-hpa-keda-pubsub -n keda-pubsub -o yaml
You can read more about autoscaling in the Kubernetes documentation.
Wait until KEDA acknowledges that the Pub/Sub subscription is empty, and scales the Deployment to zero replicas.
Inspect the workload autoscaler:
kubectl describe hpa keda-hpa-keda-pubsub -n keda-pubsub
Observe that in the command response, the
ScalingActive
condition is false. The associated message shows that the Horizontal Pod Autoscaler acknowledges that KEDA scaled the deployment to zero, at which point it stops operating until the Deployment scales back up to one Pod.Name: keda-hpa-keda-pubsub Namespace: keda-pubsub Metrics: ( current / target ) "s0-gcp-ps-projects-[...]]" (target average value): 0 / 10 Min replicas: 1 Max replicas: 5 Deployment pods: 5 current / 5 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ScaleDownStabilized recent recommendations were higher than current one [...] ScalingActive False ScalingDisabled scaling is disabled since the replica count of the target is zero ScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count
Trigger the scale-up
To stimulate the Deployment to scale up:
Enqueue messages on the Pub/Sub topic:
for num in {1..20} do gcloud pubsub topics publish keda-echo --project=${PROJECT_ID} --message="Test" done
Verify that the Deployment is scaling up:
kubectl get deployments -n keda-pubsub
In the output, observe that the 'Ready' column shows one replica:
NAME READY UP-TO-DATE AVAILABLE AGE keda-pubsub 1/1 1 1 2d
KEDA scales up the Deployment after it observes that the queue is not empty.
Scale your LLM workload to zero
This section describes a Large Language Model (LLM) workload that deploys an Ollama server with attached GPU. Ollama allows running popular LLMs such as Gemma and Lamma 2, and exposes its features primarily through HTTP.
Install KEDA-HTTP add-on
Scaling an HTTP service down to zero Pods during periods of inactivity causes request failures, since there's no backend to handle the requests.
This section shows how to solve this problem using the KEDA-HTTP add-on. KEDA-HTTP starts an HTTP proxy that receives user requests and forwards them to the Services configured to scale-to-zero. When the Service has no Pod, the proxy triggers the Service to scale up, and buffers the request until the Service has scaled up to at least one Pod.
Install the KEDA-HTTP add-on using Helm. For more information, refer to KEDA-HTTP documentation.
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update
# Set the proxy timeout to 120s, giving Ollama time to start.
helm install http-add-on kedacore/keda-add-ons-http \
--create-namespace --namespace keda \
--set interceptor.responseHeaderTimeout=120s
Deploy an Ollama LLM workload
To deploy an Ollama LLM workload:
Create a node pool containing
g2-standard-4
nodes with attached GPUs, and configure cluster autoscaling to provide between zero and two nodes:gcloud container node-pools create gpu --machine-type=g2-standard-4 \ --location=${LOCATION} --cluster=scale-to-zero \ --min-nodes 0 --max-nodes 2 --num-nodes=1 --enable-autoscaling
Add the official Ollama Helm chart repository, and update your local Helm client's repository:
helm repo add ollama-helm https://otwld.github.io/ollama-helm/ helm repo update
Deploy the Ollama server using the Helm chart:
helm install ollama ollama-helm/ollama --create-namespace --namespace ollama \ -f https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/cost-optimization/gke-keda/ollama/helm-values-ollama.yaml
The
helm-values-ollama.yaml
configuration specifies the LLM models to load, the GPU requirements, and the TCP port for the Ollama server.
Configure scale-to-zero
To configure your Ollama workload to scale to zero, KEDA-HTTP uses an
HTTPScaledObject
.
Create the
HTTPScaledObject
resource to describe the expected autoscaling behavior:kubectl apply -f https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/tree/main/cost-optimization/gke-keda/ollama/keda-ollama-httpscaledobject.yaml
This creates the
HTTPScaledObject
object that defines the following fields:scaleTargetRef
: specifies the Service to which KEDA-HTTP should forward the requests. In this example, all requests with the hostollama.ollama
are routed to the Ollama server.scaledownPeriod
: specifies (in seconds) how fast to scale down when no requests are received.replicas
: specifies the minimum and maximum number of Pods to maintain for the Ollama deployment.scalingMetric
: specifies the metrics used to drive autoscaling, such as request rate in this example. For more metric options, see the KEDA-HTTP documentation.
kind: HTTPScaledObject apiVersion: http.keda.sh/v1alpha1 metadata: namespace: ollama name: ollama spec: hosts: - ollama.ollama scaleTargetRef: name: ollama kind: Deployment apiVersion: apps/v1 service: ollama port: 11434 replicas: min: 0 max: 2 scaledownPeriod: 3600 scalingMetric: requestRate: targetValue: 20
Run the following command to verify that KEDA-HTTP has successfully processed the
HTTPScaledObject
created in the previous step.kubectl get hpa,scaledobject -n ollama
The output shows the
HorizontalPodAutoscaler
(created by KEDA), and theScaledObject
(created by KEDA-HTTP) resources:NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE horizontalpodautoscaler.autoscaling/keda-hpa-ollama Deployment/ollama 0/100 (avg) 1 2 1 2d NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE scaledobject.keda.sh/ollama apps/v1.Deployment ollama 0 2 external-push True False False Unknown 2d
Verify that the Deployment scales down to zero Pods.
Wait the period of time set in the
scaledownPeriod
field and run the command:kubectl get deployments -n ollama
The output shows that KEDA scaled down the Ollama deployment, and that no Pods are running:
NAME READY UP-TO-DATE AVAILABLE AGE ollama 0/0 0 0 2d
Trigger the scale-up
To stimulate the Deployment to scale up, call the Ollama service using the proxy set up by the KEDA-HTTP add-on. This causes the value of the request rate metric to increase, and triggers the creation of a first Pod.
Use kubectl
port forwarding capabilities to access the proxy because the proxy
is not exposed externally.
kubectl port-forward svc/keda-add-ons-http-interceptor-proxy -n keda 8080:8080 &
# Set the 'Host' HTTP header so that the proxy routes requests to the Ollama server.
curl -H "Host: ollama.ollama" \
http://localhost:8080/api/generate \
-d '{ "model": "gemma:7b", "prompt": "Hello!" }'
The curl
command sends the prompt "Hello!" to a Gemma model. Observe
the answer tokens coming back in the response. For the specification of the API,
see the Ollama guide.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Clean up the Pub/Sub subscription and topic:
gcloud pubsub subscriptions delete keda-echo-read gcloud pubsub topics delete keda-echo
Delete your GKE cluster:
gcloud container clusters delete scale-to-zero --location=${LOCATION}
What's next
- Learn more about autoscaling LLM inference workloads in GKE.
- Explore the KEDA GitHub repository and documentation.