Running web applications on GKE using cost-optimized PVMs

This tutorial shows you how to handle preemptions while running preemptible VMs (PVMs) on Google Kubernetes Engine (GKE) that's serving a web application. PVMs are affordable, short-lived compute instances suitable for fault-tolerant workloads. They offer the same machine types and options as regular compute instances and last for up to 24 hours.

This tutorial is intended for application developers, system architects, and devops engineers who define, implement, and deploy web-facing applications and want to use PVMs in production deployments. The tutorial assumes you understand fundamental Kubernetes concepts and various load balancing components in HTTP(S) Load Balancing.

Background

A PVM is limited to a 24-hour runtime, and receives a 30-second warning of shutdown when the instance is about to be preempted. The PVM initially sends a preemption notice to the instance in the form of an ACPI G2 Soft Off (SIGTERM) signal. After 30 seconds, an ACPI G3 Mechanical Off (SIGKILL) signal is sent to the instance operating system. The VM then transitions the instance to a TERMINATED state.

PVMs are a good choice for distributed, fault-tolerant workloads that don't require continuous availability of a single instance. Examples of this type of workload include video encoding, rendering for visual effects, data analytics, simulation, and genomics. However, because of availability limitations and potentially frequent interruptions resulting from preemptions, PVMs are generally not recommended for web- and user-facing applications.

This tutorial walks through setting up a deployment that uses a combination of PVMs and standard VMs on GKE to help reliably serve web application traffic without any disruption.

Challenges of using PVMs

The biggest challenge of using PVMs in serving user-facing traffic is to ensure that user requests are not disrupted. On preemption, you must address the following:

  • How do you ensure an application's availability when it's running on PVMs? PVMs don't have guaranteed availability and are explicitly excluded from Compute Engine service level agreements.
  • How do you handle graceful termination of the application such that the following are true:
    • The load balancer stops forwarding requests to the Pods that are running on an instance that's being preempted.
    • In-flight requests are gracefully handled; they either complete or are shut down.
    • Connections to your databases and the application are closed or drained before the instance is shut down.
  • How do you handle requests, such as business-critical transactions, that might require uptime guarantees or that are not fault tolerant?

Consider the challenges in gracefully shutting down containers that are running on PVMs. From an implementation standpoint, the easiest way to write cleanup logic when an instance is shutting down is through a shutdown script. However, shutdown scripts are not supported if you are running containerized workloads in GKE.

Alternatively, you can use a SIGTERM handler in your application to write cleanup logic. Containers also provide lifecycle hooks such as preStop, which is triggered just before the container is shut down. In Kubernetes, Kubelet is responsible for executing container lifecycle events. Within a Kubernetes cluster, Kubelet runs on the VM and watches for Pod specs through the Kubernetes API server.

When you evict a Pod using a command line or API, Kubelet sees that the Pod has been marked as terminating and begins the shutdown process. The process duration is limited by the "grace period," which is defined as a set number of seconds after which Kubelet sends a SIGKILL signal to the containers. As part of the graceful shutdown, if any container running in the Pod has defined a preStop hook, the Kubelet runs the hook inside the container. Next, Kubelet triggers a SIGTERM signal to process-ID 1 inside each container. If the application is running a SIGTERM handler, then the handler is executed. When the handler is done, Kubelet sends a SIGKILL signal to any processes still running in the Pod.

Imagine that a Kubernetes node is undergoing preemption. You need a mechanism to catch the preemption notice and start the process of evicting the Pod. Suppose that you are running a program that listens for a preemption notice and evicts running Pods upon receiving an event. On eviction, the Pod shutdown sequence described previously gets triggered. However, in this case, the node is also undergoing a shutdown, which is handled by the node's operating system (OS). This shutdown can potentially interfere with Kubelet's handling of the container lifecycle, which means that the container can be abruptly shut down even if it's in the middle of executing a preStop hook.

Furthermore, from the standpoint of availability and traffic management, running your web-facing application exclusively on PVMs can also pose various challenges. Before using PVMs, consider the following questions:

  • What happens if most of the PVMs are preempted at once? For applications serving thousands of requests per second, how do the requests failover without disruption?
  • What happens if PVM capacity is not available? How do you scale out or maintain a steady state of deployment for your application in that case?

Architecture

To solve the challenges of using PVMs, you must do all of the following:

  • Run a GKE cluster with two node pools—one running PVMs and the other running standard VMs. This lets you split traffic and have an active failover to handle new and in-flight requests in the event of a preemption. This approach also lets you split traffic between standard VMs and PVMs based on your requirements of uptime guarantees and fault tolerance. For more information about deciding the size of node pools, see what to consider.
  • Listen for preemption notices on PVMs and evict Pods that are running on the node.
  • Use a preStop hook or SIGTERM handler to execute cleanup logic.
  • Ensure that Kubelet is allowed to handle the Pod termination lifecycle and is not abruptly shut down.
  • Taint the node so that no new Pods are scheduled on it while it is being preempted.

The following diagram shows a high-level view of the architecture that you deploy in this tutorial.

High-level architecture.

As the diagram shows, you create two node pools: default-pool runs on standard VMs and pvm-pool runs on PVMs. The default GKE scheduler tries to evenly distribute Pods across all the instances in node pools. For example, if you are deploying four replicas and have two nodes running in each node pool, the scheduler provisions one Pod each on all four nodes. However, when using PVMs, you might want to direct more traffic towards pvm-pool to get higher PVM utilization and therefore cost savings. For example, you might want to provision three Pods in pvm-pool and one Pod in default-pool for failover, thereby reducing the number of standard VMs in the cluster.

To achieve this kind of control over scheduling, you can either write your own scheduler or split the application into two deployments. Each deployment will be pinned to a node pool based on node affinity rules. In this example, you create two deployments, web-std and web-pvm, where web-std is pinned to default-pool and web-pvm is pinned to pvm-pool.

For pvm-pool nodes, you must listen for a preemption notice and, upon receiving it, start evicting Pods. You can write your own logic to listen for the preemption notice and evict Pods that are running on the node. Or, as in this tutorial, you can create an agent by using the k8s-node-termination-handler event handler.

k8s-node-termination-handler uses a Kubernetes daemonset to create an agent on every instance in pvm-pool. The agent watches for a node termination event by using Compute Engine metadata APIs. Whenever a termination event is observed, the agent starts the Pod eviction process. In this example, 20 seconds are allocated as a grace period for regular Pods and 10 seconds for system Pods. This means that if there is a preStop hook or SIGTERM handler configured for the Pod, it can run up to 20 seconds before the Pod is shut down by SIGKILL. The grace period is a configurable parameter in k8s-node-termination-handler. The total grace period for regular and system Pods cannot exceed more than the preemption notice, 30 seconds.

The agent also taints the node to prevent new Pods from being scheduled.

When you evict Pods, the preStop hook gets triggered. In this example, preStop is configured to do two tasks:

  • Fail the health check of the application. This is done to signal to the load balancer that it must remove the Pod from the request serving path.
  • Sleep for the duration of the grace period (20 seconds) allotted by k8s-node-termination-handler. The Pod stays alive for 20 seconds to process any in-flight requests.

While the preStop hook is executing, to ensure that Kubelet is not abruptly shut down by the node OS, you create a systemd service that blocks the shutdown. This approach helps ensure that Kubelet can manage the lifecycle of Pods during shutdown without the node OS's interference. You use a Kubernetes daemonset to create the service. This daemonset will run on every instance in pvm-pool.

This example uses Traffic Director to manage traffic. Note that you can use any proxy solution such as OSS Envoy, Istio, Nginx, or HAProxy to manage traffic as long as you adhere to the subsystem guidelines used in this example, where Traffic Director is configured to do the following:

  • Enable weighted routing of the requests. In this example, you create three application replicas on pvm-pool and one on default-pool. Traffic Director is configured to split traffic 75-25% between pvm-pool and default-pool. In case of preemption, all the requests are automatically failed over to default-pool.

    This tutorial provides a simplistic example of traffic splitting. You can also set match conditions on traffic ports, header fields, URIs, and more to route the request to a specific node pool. For more information, see advanced traffic routing techniques using Traffic Director.

  • In case of preemption, if the request results in a 5xx status code (for example, due to a gateway not being available or an upstream connection timing out), Traffic Director retries the requests up to three times.

  • Use a circuit breaker to limit the maximum number of retries that can be outstanding at any given time.

  • Use outlier detection to evict unhealthy endpoints from the load balancer's serving path.

Traffic Director uses the sidecar proxy model. In this model, the following events occur:

  1. Clients send requests to a Google Cloud–managed load balancer.
  2. The load balancer sends the traffic to an edge proxy configured by Traffic Director.
  3. The edge proxy applies predefined policies (such as request distribution between different endpoints) and retrial and circuit-breaking policies, and it load-balances the requests to services in the GKE cluster.
  4. The sidecar proxy intercepts the traffic and forwards it to the application.

For more information, see how traffic interception and forwarding works with Traffic Director.

Objectives

  • Deploy a GKE cluster with two node pools using standard VMs and PVMs.
  • Deploy a sample application.
  • Configure cluster resources to gracefully handle preemption.
  • Configure Traffic Director to control traffic to GKE services.
  • Simulate preemption of a PVM.
  • Verify that the Pods are gracefully shut down.
  • Verify that new requests are not forwarded to Pods that are undergoing preemption.
  • Verify that in-flight and new requests are served without any disruption.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Before you begin

  1. In the Google Cloud Console, go to the project selector page.

    Go to the project selector page

  2. Select or create a Google Cloud project.

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  5. Enable the Compute Engine, GKE, Container Analysis, Cloud Build, Container Registry, and Traffic Director API.

    Enable the API

  6. Find your project ID and set it in Cloud Shell. Replace YOUR_PROJECT_ID with your project ID.
    gcloud config set project YOUR_PROJECT_ID
    
  7. Export the following environment variables:
    export PROJECT=$(gcloud config get-value project)
    export CLUSTER=$PROJECT-gke
    export REGION="us-central1"
    export ZONE="us-central1-c"
    export TERMINATION_HANDLER="https://github.com/GoogleCloudPlatform/k8s-node-termination-handler"
    

Creating a GKE cluster

  1. In Cloud Shell, create a GKE cluster that has a default node pool, default-pool, with standard instances, and a custom node pool, pvm-pool, with PVMs:

    gcloud beta container clusters create $CLUSTER \
       --zone=$ZONE \
       --num-nodes="1" \
       --enable-ip-alias \
       --machine-type="n1-standard-4" \
       --scopes=https://www.googleapis.com/auth/cloud-platform
    gcloud beta container node-pools create "pvm-pool" \
       --cluster=$CLUSTER \
       --zone=$ZONE \
       --preemptible \
       --machine-type="n1-standard-4" \
       --scopes=https://www.googleapis.com/auth/cloud-platform \
       --num-nodes="1"
    
  2. Clone the solutions-gke-pvm-preemption-handler code repository that you use for this tutorial:

    git clone https://github.com/GoogleCloudPlatform/solutions-gke-pvm-preemption-handler && \
    cd solutions-gke-pvm-preemption-handler
    
  3. Create a daemonset that runs on PVM instances in the GKE cluster, and create a systemd service that blocks the shutdown of the Kubelet process:

    kubectl apply -f daemonset.yaml
    
  4. Get the name of the PVM instance deployed as part of the pvm-pool node pool:

    PVM=$(kubectl get no \
        -o=jsonpath='{range .items[*]} \
        {.metadata.name}{"\n"}{end}' | grep pvm)
    
  5. Verify that the service is deployed correctly:

    1. Create a firewall rule in order to use SSH to connect to the PVM through IAP forwarding:

      gcloud compute firewall-rules create allow-ssh-ingress-from-iap \
          --direction=INGRESS \
          --action=allow \
          --rules=tcp:22 \
          --source-ranges=35.235.240.0/20
      
    2. Use SSH to connect to the PVM:

      gcloud compute ssh $PVM --tunnel-through-iap --zone=$ZONE
      
    3. In the PVM terminal, check the status of the deployed service:

      systemctl status delay.service
      

      You see the state of the service as Active (exited).

      ...
      delay.service - Delay GKE shutdown
         Loaded: loaded (/etc/systemd/system/delay.service; enabled; vendor preset: disabled)
         Active: active (exited) since Tue 2020-07-21 04:48:33 UTC; 1h 17min ago
      ...
      
    4. Exit the PVM terminal by typing exit.

  6. Deploy k8s-node-termination-handler:

    1. Clone the repository:

      git clone $TERMINATION_HANDLER
      
    2. In a text editor, open the k8s-node-termination-handler/deploy/k8s.yaml file and look for the following line:

      args: ["--logtostderr", "--exclude-pods=$(POD_NAME):$(POD_NAMESPACE)", "-v=10", "--taint=cloud.google.com/impending-node-termination::NoSchedule"]
      
    3. Replace the previous line with the following line, which allocates a 10-second grace period to shut down system Pods. The remaining 20 seconds in the grace period is automatically allocated to regular Pods.

      args: ["--logtostderr", "--exclude-pods=$(POD_NAME):$(POD_NAMESPACE)", "-v=10", "--taint=cloud.google.com/impending-node-termination::NoSchedule", "--system-pod-grace-period=10s"]
      
    4. Deploy the handler:

      kubectl apply \
          -f k8s-node-termination-handler/deploy/k8s.yaml \
          -f k8s-node-termination-handler/deploy/rbac.yaml
      
  7. Verify that the node termination handler was deployed correctly:

    kubectl get ds node-termination-handler -n kube-system
    

    The output is similar to the following:

    NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    node-termination-handler   1         1         1       1            1           <none>          101s
    

Deploying the application

  1. In Cloud Shell, configure firewall rules for health checks:

    gcloud compute firewall-rules create fw-allow-health-checks \
        --action=ALLOW \
        --direction=INGRESS \
        --source-ranges=35.191.0.0/16,130.211.0.0/22 \
        --rules tcp
    
  2. Deploy the application:

    kubectl apply -f deploy.yaml
    
  3. Verify that there are two different deployments running on default-pool and pvm-pool:

    1. For default-pool, run the following:

      kubectl get po -l app=web-std \
          -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
      

      The output is similar to the following:

      NAME                       Node
      web-std-695b5fb6c4-55gcc   gke-vital-octagon-109612-default-pool-dcdb8fe5-2tc7
      
    2. For pvm-pool, run the following:

      kubectl get po -l app=web-pvm \
          -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
      

      The output is similar to the following.

      NAME                       Node
      web-pvm-6f867bfc54-nm6fb   gke-vital-octagon-109612-gke-pvm-pool-664ec4ff-2cgc
      

      For each service, a standalone NEG is created containing endpoints that are the Pod's IP addresses and ports. For more information and examples, see Standalone network endpoint groups.

  4. Confirm that the standalone NEG was created:

    gcloud beta compute network-endpoint-groups list
    

    The output is similar to the following.

    NAME                                       LOCATION       ENDPOINT_TYPE   SIZE
    k8s1-be35f81e-default-web-pvm-80-7c99357f  us-central1-c  GCE_VM_IP_PORT  1
    k8s1-be35f81e-default-web-std-80-f16dfcec  us-central1-c  GCE_VM_IP_PORT  1
    

    To manage the application by using Traffic Director, you must deploy the deployments as network endpoint groups (NEGs). As discussed in the Architecture section, you use a sidecar proxy to create the deployments.

  5. Verify that the deployments are created with a sidecar proxy:

    kubectl get pods -l app=web-std \
        -o jsonpath={.items[*].spec.containers[*].name}
    

    The output is similar to the following.

    hello-app istio-proxy
    

    To see similar results for Pods that are running in pvm-pool, you can run the following:

    kubectl get pods -l app=web-pvm \
        -o jsonpath={.items[*].spec.containers[*].name}
    

Creating the Traffic Director service

Traffic Director uses a configuration similar to other Cloud Load Balancing products—in other words, you must configure the following components for Traffic Director:

  1. In Cloud Shell, find the NEGs that you created previously and store their names in a variable:

    1. For the default-pool service, use the following:

      NEG_NAME_STD=$(gcloud beta compute network-endpoint-groups list \
                     | grep web-std | awk '{print $1}')
      
    2. For the pvm-pool service, use the following:

      NEG_NAME_PVM=$(gcloud beta compute network-endpoint-groups list \
                     | grep web-pvm | awk '{print $1}')
      
  2. Create the health check:

      gcloud compute health-checks create http td-gke-health-check \
          --request-path=/health \
          --use-serving-port \
          --healthy-threshold=1 \
          --unhealthy-threshold=2 \
          --check-interval=2s \
          --timeout=2s
    
  3. Replace the placeholders in the manifest files:

    sed -i -e "s/\[PROJECT_ID\]/$PROJECT/g" td-gke-service-config.yaml
    sed -i -e "s/\[ZONE\]/$ZONE/g" td-gke-service-config.yaml
    sed -i -e "s/\[NEG_NAME_STD\]/$NEG_NAME_STD/g" td-gke-service-config.yaml
    sed -i -e "s/\[NEG_NAME_PVM\]/$NEG_NAME_PVM/g" td-gke-service-config.yaml
    sed -i -e "s/\[PROJECT_ID\]/$PROJECT/g" td-urlmap.yaml
    
  4. Create a Traffic Director service:

    gcloud compute backend-services import td-gke-service \
        --source=td-gke-service-config.yaml --global
    

    The service is configured to split traffic by using capacity scaler between the two NEGs that you created earlier. The service is also configured with outlier detection:

    • For outlier detection, set Consecutive errors (or gateway failures before a host is evicted from the service) to 2.
    • For circuit breaking, set Max retries to 3.
  5. Verify that the Traffic Director service is deployed correctly:

    gcloud compute backend-services get-health td-gke-service --global
    

    The output is similar to the following:

    ‐‐‐
    backend: default-pool-service-NEG
    status:
      healthStatus:
      ‐ healthState: HEALTHY
    ...
    ‐‐‐
    backend: pvm-pool-service-NEG
    status:
      healthStatus:
      ‐ healthState: HEALTHY
    ...
    

    You might have to wait for a few minutes, and run the command multiple times, before the backend is shown as HEALTHY.

  6. Create a URL map that uses the service you created:

    gcloud compute url-maps import web-service-urlmap \
        --source=td-urlmap.yaml
    

    The URL map sets up traffic routing. All requests to the path "/*" are redirected to the Traffic Director service that you created. Additionally, the map also sets up a policy to retry requests (maximum 3 times) that resulted in a 5xx status code.

  7. Create the target HTTP proxy:

    gcloud compute target-http-proxies create td-gke-proxy \
        --url-map=web-service-urlmap
    
  8. Create the forwarding rule that uses the virtual IP (VIP) address 0.0.0.0:

    gcloud compute forwarding-rules create td-gke-forwarding-rule \
        --global \
        --load-balancing-scheme=INTERNAL_SELF_MANAGED \
        --address=0.0.0.0 \
        --target-http-proxy=td-gke-proxy \
        --ports=80
    

    At this point, GKE services in default-pool and pvm-pool are accessible on the service VIP that is load-balanced by Traffic Director.

Creating the load balancer

In this section, you configure a load balancer and an edge proxy for user traffic. The load balancer acts as a gateway to the setup that you just created.

  1. In Cloud Shell, create a Traffic Director–managed edge proxy:

    kubectl apply -f edge-proxy.yaml
    
  2. Verify that the load balancer is healthy and ready to serve traffic:

    kubectl describe ingress gateway-proxy-ingress
    

    The output is similar to the following:

    ...
      Host        Path  Backends
      ‐‐‐‐        ‐‐‐‐  ‐‐‐‐‐‐‐‐
      *           *     gateway-proxy-svc:80 (10.20.0.14:80)
    
    Annotations:  ingress.kubernetes.io/backends: {"k8s1-da0dd12b-default-gateway-proxy-svc-80-b3b7b808":"HEALTHY"}
    ...
    

    The backend should be in the HEALTHY state. It might take several minutes, and multiple retries of the command, for the load balancer to be ready to accept traffic.

  3. Record the IP address for use later:

    IPAddress=$(kubectl get ingress gateway-proxy-ingress \
                -o jsonpath="{.status.loadBalancer.ingress[*].ip}")
    

Generating traffic

Now that configuration is complete, it's time to test it.

  1. In Cloud Shell, click Open a new tab + to start a new Cloud Shell session.

  2. In the new shell, set the project ID:

    gcloud config set project YOUR_PROJECT_ID
    
  3. Install Kubetail to observe multiple application Pod logs together:

    sudo apt-get update
    sudo apt-get install kubetail
    kubetail web
    
  4. In the original shell, simulate traffic:

    seq 1 100 | xargs -I{} -n 1 -P 10 curl -I http://$IPAddress
    

    This command generates 100 requests, 10 parallel requests at a time.

    In the new shell, you see logs similar to the following:

    ---
    [web-pvm-6f867bfc54-nm6fb hello-app] Received request at: 2020-07-20 20:26:23.393
    [web-pvm-6f867bfc54-nm6fb hello-app] Received request at: 2020-07-20 20:26:23.399
    [web-std-6f867bfc54-55gcc hello-app] Received request at: 2020-07-20 20:26:24.001
    ...
    

    Verify that requests are distributed between node pools based on the 75%-25% split that you configured earlier.

  5. Back in the original shell, generate traffic by using httperf:

    sudo apt-get install httperf && \
    httperf --server=$IPAddress --port=80 --uri=/ \
            --num-conns=5000 --rate=20 --num-calls=1 \
            --print-reply=header
    

    This command installs httperf and generates 5,000 requests against the example application at the rate of 20 requests per second. This test runs for ~250 seconds.

Simulating preemption

  1. In the new shell, exit the Kubetail command by pressing Ctrl+C.
  2. Get the name of the PVM instance deployed as part of pvm-pool:

    PVM=$(kubectl get no \
          -o=jsonpath='{range .items[*]} \
          {.metadata.name}{"\n"}{end}' | grep pvm)
    
  3. While the httperf test is in progress, trigger a maintenance event on the PVM instance to simulate preemption:

    gcloud compute instances simulate-maintenance-event $PVM \
        --zone=us-central1-c
    
  4. Observe the application logs in the new shell:

    kubectl logs -f deploy/web-pvm -c hello-app
    

    When the Pod receives a SIGTERM signal, the health check status for the application explicitly reports fail:

    ...
    Health status fail at: 2020-07-21 04:45:43.742
    ...
    

    The Pod continues to receive requests until it is removed from the request serving path because of failed health checks. This removal might take a few seconds to propagate.

    ...
    Received request at: 2020-07-21 04:45:45.735
    Received request at: 2020-07-21 04:45:45.743
    Health status fail at: 2020-07-21 04:45:45.766
    ...
    

    After a few seconds, the Pod stops receiving new requests. The Pod continues to live for 20 seconds. (The preStop handler is made to sleep for 20 seconds to allow cleanup activities, including any in-flight requests.) The Pod is then shut down.

    ...
    Health status fail at: 2020-07-21 04:46:01.796
    2020-07-21 04:46:02.303  INFO 1 --- [       Thread-3] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@27ddd392: startup date [Tue Jul 21 04:39:44 UTC 2020]; root of context hierarchy
    Exiting PreStop hook
    ...
    
  5. Optionally, execute the command kubetail web to tail logs from all application Pods that are running across both node pools. The output shows requests being routed to default-pool as the PVM is undergoing preemption:

    ...
    [web-pvm-6f867bfc54-nm6fb] Received request at: 2020-07-20 20:45:45.743
    [web-pvm-6f867bfc54-nm6fb] Health status fail at: 2020-07-21 04:45:45.766
    [web-std-6f867bfc54-55gcc] Received request at: 2020-07-20 04:45:45.780
    [web-std-6f867bfc54-55gcc] Received request at: 2020-07-20 04:45:45.782
    ...
    

Post-preemption validations

  1. In the original shell, wait for the Httperf test to complete. On completion, the output is similar to the following.

    ...
    Reply status: 1xx=0 2xx=5000 3xx=0 4xx=0 5xx=0
    ...
    Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
    ...
    

    The output indicates that post-preemption the requests were served by default-pool, and pvm-pool was gracefully shut down.

  2. Verify that the PVM is up again and that the original state of the cluster is restored:

    kubectl get po -l app=web-pvm \
        -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
    

    Notice that two replicas are provisioned on pvm-pool:

    NAME                     Node
    web-pvm-6f867bfc54-9z2cp gke-vital-octagon-109612-gke-pvm-pool-664ec4ff-49lx
    

Considerations

Before running this solution in production, consider these qualifications:

  1. This tutorial does not employ Pod or cluster autoscaling policies. For production environments, make sure you have the right autoscaling in place to handle traffic spikes.
  2. Carefully consider the split between standard VMs and PVMs in the cluster. For example, suppose you are running 100 PVMs and only 1 standard VM, and 50% of PVMs undergo a preemption at once. In this case, the standard VMs pool will take some time to scale out to make up for the preempted resources. In the meantime, user traffic is impacted. To mitigate large preemptions, you can use third-party applications, like Spot and estafette-gke-preemptible-killer, to spread out the preemptions to prevent multiple instances from going down together.
  3. Based on your use case, carefully test the grace period that's allocated to regular and system Pods by using k8s-node-termination-handler. Given the criticality of the application, you might have to allocate more than 20 seconds to the regular Pods. A possible downside of this approach is that it might not leave sufficient time for system Pods to cleanly shut down. This can result in possible loss of logs and monitoring metrics that are handled by system Pods.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the Cloud project that you created for this tutorial, or delete the resources associated with this tutorial.

Delete the Cloud project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the resources

If you want to keep the project that you used in this tutorial, delete the individual resources.

  1. In Cloud Shell, delete the GKE cluster:

    gcloud container clusters delete $CLUSTER --zone=$ZONE --async
    
  2. Delete the firewall rules:

    gcloud compute firewall-rules delete allow-ssh-ingress-from-iap && \
    gcloud compute firewall-rules delete fw-allow-health-checks
    
  3. Delete the Traffic Director service:

    gcloud compute forwarding-rules delete td-gke-forwarding-rule \
        --global && \
    gcloud compute target-http-proxies delete td-gke-proxy \
        --global && \
    gcloud compute url-maps delete web-service-urlmap \
        --global && \
    gcloud compute backend-services delete td-gke-service \
        --global && \
    gcloud compute health-checks delete td-gke-health-check \
        --global
    
  4. Delete all NEGs:

    NEG_NAME_EDGE=$(gcloud beta compute network-endpoint-groups list \
                    | grep gateway-proxy | awk '{print $1}') && \
    gcloud beta compute network-endpoint-groups delete $NEG_NAME_EDGE \
        --zone=$ZONE && \
    gcloud beta compute network-endpoint-groups delete $NEG_NAME_STD \
        --zone=$ZONE && \
    gcloud beta compute network-endpoint-groups delete $NEG_NAME_PVM \
        --zone=$ZONE
    
  5. Delete downloaded code, artifacts, and other dependencies:

    cd .. && rm -rf solutions-gke-pvm-preemption-handler
    

What's next