Learning Path: Scalable applications - Scaling

Autopilot

This set of tutorials is for IT administrators and Operators that want to deploy, run, and manage modern application environments that run on Google Kubernetes Engine (GKE). As you progress through this set of tutorials you learn how to configure monitoring and alerts, scale workloads, and simulate failure, all using the Cymbal Bank sample microservices application:

Create a cluster and deploy a sample application
Monitor with Google Cloud Managed Service for Prometheus
Scale workloads (this tutorial)
Simulate a failure

Overview and objectives

A consumer application like Cymbal Bank often has varying numbers of users at different times. Ideally your website is able to cope with surges in traffic without slowing down or having other issues, but without the organization having to spend money on Cloud resources that they don't actually need. A solution that Google Cloud provides for this is autoscaling.

In this tutorial, you learn how to configure clusters and workloads in a GKE cluster to scale using both built-in Kubernetes metrics and custom metrics from Cloud Monitoring and Cloud Trace. You learn how to complete the following tasks:

Enable custom metrics in Cloud Monitoring for Trace.
- Custom metrics let you scale using additional monitoring data or external inputs beyond the awareness of the Kubernetes cluster, like network traffic or HTTP response codes.
Configure the Horizontal Pod Autoscaler, a GKE feature that can automatically increase or decrease the number of Pods for a workload depending on specified metrics.
Simulate application load and view the cluster autoscaler and Horizontal Pod Autoscaler respond.

Costs

Enabling GKE and deploying the Cymbal Bank sample application for this series of tutorials means that you incur per-cluster charges for GKE on Google Cloud as listed on our Pricing page until you disable GKE or delete the project.

You are also responsible for other Google Cloud costs incurred while running the Cymbal Bank sample application, such as charges for Compute Engine VMs and Trace.

Before you begin

To learn how to scale your deployments, you must complete the first tutorial to create a GKE cluster that uses Autopilot and deploy the Cymbal Bank sample microservices-based application.

We recommend that you complete this set of tutorials for scalable apps in order. As you progress through the set of tutorials, you learn new skills and use additional Google Cloud products and services.

You also need to create an IAM service account and grant some permissions for the Horizontal Pod Autoscaler to work correctly:

Create an IAM service account. This service account is used in the tutorial to grant access to custom metrics that allow the Horizontal Pod Autoscaler to determine when to scale up or down:
```
gcloud iam service-accounts create scalable-apps
```

Grant access to the IAM service account to perform the required scaling actions:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --role roles/cloudtrace.agent \
  --member "serviceAccount:scalable-apps@PROJECT_ID.iam.gserviceaccount.com"

gcloud projects add-iam-policy-binding PROJECT_ID \
  --role roles/monitoring.metricWriter \
  --member "serviceAccount:scalable-apps@PROJECT_ID.iam.gserviceaccount.com"

gcloud iam service-accounts add-iam-policy-binding "scalable-apps@PROJECT_ID.iam.gserviceaccount.com" \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[default/default]"

The following access is granted to the IAM service account:

roles/cloudtrace.agent: Write trace data such as latency information to Trace.
roles/monitoring.metricWriter: Write metrics to Cloud Monitoring.
roles/iam.workloadIdentityUser: Allow a Kubernetes service account to use Workload Identity Federation for GKE to act as the IAM service account.

Configure the default Kubernetes service account in the default namespace to act as the IAM service account that you created:
```
kubectl annotate serviceaccount default \
    iam.gke.io/gcp-service-account=scalable-apps@PROJECT_ID.iam.gserviceaccount.com
```
This configuration allows Pods that use the default Kubernetes service account in the default namespace to access the same Google Cloud resources as the IAM service account.

Set up custom metrics collection

You can configure the Horizontal Pod Autoscaler to use basic built-in Kubernetes CPU and memory metrics, or you can use custom metrics from Cloud Monitoring like HTTP requests per second or the quantity of SELECT statements. Custom metrics can work without application changes, and give your cluster more insight into the overall performance and needs of the application. In this tutorial, you learn how to use both the built-in and custom metrics.

To allow Horizontal Pod Autoscaler to read custom metrics from Monitoring, you must install the Custom Metrics - Stackdriver Adapter adapter in your cluster.

Deploy the custom metrics Stackdriver adapter to your cluster:
```
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
```
To allow the Stackdriver adapter to get custom metrics from your cluster, you use Workload Identity Federation for GKE. This approach uses an IAM service account that has permissions to read monitoring metrics.

Grant the IAM service account the roles/monitoring.viewer role:
```
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:scalable-apps@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/monitoring.viewer
```

Configure the Stackdriver adapter to use Workload Identity Federation for GKE and the IAM service account that has permissions to read the monitoring metrics:

gcloud iam service-accounts add-iam-policy-binding scalable-apps@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"

Kubernetes includes its own system for service accounts for access within a cluster. To let your applications authenticate to services and resources outside of your Google Kubernetes Engine clusters, such as Monitoring, you use Workload Identity Federation for GKE. This approach configures the Kubernetes service account to use the IAM service account for GKE.

Annotate the Kubernetes service account that the adapter uses:
```
kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
    --namespace=custom-metrics \
    iam.gke.io/gcp-service-account=scalable-apps@PROJECT_ID.iam.gserviceaccount.com
```

Restart the Stackdriver adapter Deployment to apply the changes:

kubectl rollout restart deployment custom-metrics-stackdriver-adapter \
    --namespace=custom-metrics

Configure the Horizontal Pod Autoscaler

GKE Autopilot can scale in a few different ways. In this tutorial, you see how your cluster can scale using the following methods:

Horizontal Pod Autoscaler: scales the number of Pods for a workload.
Cluster autoscaler: scales the node resources that are available in the cluster.

These two methods can work together so that as the number of Pods for your applications changes, the node resources to support those Pods also changes.

Other implementations are available to scale Pods that build on top of the Horizontal Pod Autoscaler, and you can also use the Vertical Pod Autoscaler to adjust a Pod's CPU and memory requests instead of the number of Pods.

In this tutorial, you configure the Horizontal Pod Autoscaler for the userservice Deployment using built-in metrics, and for the frontend Deployment using custom metrics.

For your own applications, work with your Application developers and Platform engineers to understand their needs and configure the Horizontal Pod Autoscaler rules.

Scale the `userservice` Deployment

When the number of users of the Cymbal Bank sample application increases, the userservice Service consumes more CPU resources. You use a HorizontalPodAutoscaler object to control how you want your application to respond to load. In the YAML manifest for the HorizontalPodAutoscaler, you define what Deployment to for the Horizontal Pod Autoscaler to scale, what metrics to monitor, and the minimum and maximum number of replicas you want to run.

Review the HorizontalPodAutoscaler sample manifest for the userservice Deployment:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: userservice
spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 5
      selectPolicy: Max
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: userservice
  minReplicas: 5
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

This manifest does the following:

Sets the maximum number of replicas during a scale-up to 50.
Sets the minimum number of during a scale-down to 5.
Uses a built-in Kubernetes metric to make scaling decisions. In this sample, the metric is CPU utilization, and the target utilization is 60%, which avoids both over- and under-utilization.

Apply the manifest to the cluster:

kubectl apply -f extras/postgres-hpa/hpa/userservice.yaml

Scale the `frontend` Deployment

In the previous section, you configured the Horizontal Pod Autoscaler on the userservice Deployment based on built-in Kubernetes metrics for CPU utilization. For the frontend Deployment, you might want to instead scale based on the number of incoming HTTP requests. This approach uses the Stackdriver adapter to read custom metrics from Monitoring for the HTTP(S) Load Balancer Ingress object.

Review the HorizontalPodAutoscaler manifest for the frontend Deployment:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend
spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 5
      selectPolicy: Max
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 5
  maxReplicas: 25
  metrics:
    - type: External
      external:
        metric:
          name: loadbalancing.googleapis.com|https|request_count
          selector:
            matchLabels:
              resource.labels.forwarding_rule_name: FORWARDING_RULE_NAME
        target:
          type: AverageValue
          averageValue: "5"

This manifest uses the following fields:

spec.scaleTargetRef: The Kubernetes resource to scale.
spec.minReplicas: The minimum number of replicas, which is 5 in this sample.
spec.maxReplicas: The maximum number of replicas, which is 25 in this sample.
spec.metrics.*: The metric to use. In this sample, this is the number of HTTP requests per second, which is a custom metric from Monitoring provided by the adapter that you deployed.
spec.metrics.external.metric.selector.matchLabels: The specific resource label to filter when scaling.

Find the name of the forwarding rule from the frontend Ingress load balancer:

export FW_RULE=$(kubectl get ingress frontend -o=jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/forwarding-rule}')
echo $FW_RULE

The output is similar to the following:

k8s2-fr-j76hrtv4-default-frontend-wvvf7381

Add your forwarding rule to the manifest:
```
sed -i "s/FORWARDING_RULE_NAME/$FW_RULE/g" "extras/postgres-hpa/hpa/frontend.yaml"
```
This command replaces FORWARDING_RULE_NAME with your saved forwarding rule.

Apply the manifest to the cluster:

kubectl apply -f extras/postgres-hpa/hpa/frontend.yaml

Simulate load

In this section, you use a load generator to simulate spikes in traffic and observe your replica count and node count scale up to accommodate the increased load over time. You can then stop generating traffic and observe the replica and node count scale down in response.

Before you start, check the status of the Horizontal Pod Autoscaler and look at the number of replicas in use.

Get the state of your HorizontalPodAutoscaler resources:

kubectl get hpa

The output is similar to the following that shows there is 1 frontend replica and 5 userservice replicas:

NAME                     REFERENCE                            TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
frontend                 Deployment/frontend                  <unknown>/5 (avg)   5         25        1          34s
userservice              Deployment/userservice               0%/60%              5         50        5          4m56s

The Cymbal Bank sample application includes a loadgenerator Service. This Service continuously sends requests imitating users to the frontend, and periodically creates new accounts and simulates transactions between them.

Expose the loadgenerator web interface locally. You use this interface to simulate load on the Cymbal Bank sample application:
```
kubectl port-forward svc/loadgenerator 8080
```
If you see an error message, try again when the Pod is running.
In a browser on your computer, open the load generator web interface:
- If you're using a local shell, open a browser and go to http://127.0.0.1:8080.
- If you're using Cloud Shell, click Web preview, and then click Preview on port 8080.
In the load generator web interface, if the Failures value shows 100%, complete the following steps to update the test settings:
1. Click the Stop button next to the failure rate counter.
2. Under Status, click the option for New test.
3. Update the Host value to the IP address of your Cymbal Bank ingress.
4. Click Start swarming.
In the load generator web interface, click the Charts tab to observe performance over time. Look at the number of requests and resource utilization.
Open a new terminal window and watch the replica count of your frontend and userservice Pods:
```
kubectl get hpa -w
```
The number of replicas increases as the load increases. The scaleUp actions might take approximately ten minutes as the cluster recognizes that the configured metrics reach the defined threshold, and use the Horizontal Pod Autoscaler to scale up the number of Pods.

The following example output shows the number of replicas has increased as the load generator runs:
```
NAME                     REFERENCE                            TARGETS          MINPODS   MAXPODS   REPLICAS
frontend                 Deployment/frontend                  5200m/5 (avg)    5         25        13
userservice              Deployment/userservice               71%/60%          5         50        17
```
Open another terminal window and check the number of nodes in the cluster:
```
gcloud container clusters list \
    --filter='name=scalable-apps' \
    --format='table(name, currentMasterVersion, currentNodeVersion, currentNodeCount)' \
    --region="REGION"
```
Replace REGION with the region that your cluster runs in.

The number of nodes has also increased from the starting quantity to accommodate the new replicas. This increase in the number of nodes is powered by GKE Autopilot. There's nothing you need to configure for this node scale.
Open the load generator interface and click Stop to end the test.
Check the replica count and node count again and observe as the numbers reduce with the reduced load. The scale down might take some time, because the default stabilization window for replicas in the Kubernetes HorizontalPodAutoscaler resource is five minutes.

In a real environment, both the number of nodes and Pods in your environment would automatically scale up and down in the same way as with this simulated load. The Cymbal Bank sample application is designed to accommodate this kind of scaling. Check with your App operators and site reliability engineering (SRE) or Application developers to see if their workloads can benefit from these scaling features.

Clean up

The set of tutorials for Cymbal Bank is designed to be completed one after the other. As your progress through the set of tutorials, you learn new skills and use additional Google Cloud products and services.

If you want to take a break before you move on to the next tutorial and avoid incurring charges to your Google Cloud account for the resources used in this tutorial, delete the project you created.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Learn how to simulate a failure in GKE in the next tutorial.