This tutorial shows you how to handle preemptions while running Spot VMs on Google Kubernetes Engine (GKE) cluster that's serving a web application. Spot VMs are affordable, short-lived compute instances suitable for fault-tolerant workloads. Spot VMs are the latest version of preemptible VMs.
This tutorial is intended for application developers, system architects, and devops engineers who define, implement, and deploy web-facing applications and want to use Spot VMs in production deployments. The tutorial assumes you understand fundamental Kubernetes concepts and are familiar with various load balancing components in an external HTTP(S) load balancer.
Background
A Spot VM receives a 30-second warning of
shutdown when the instance is about to be preempted. The Spot VM
initially sends a preemption notice to the instance in the form of an
ACPI G2 Soft Off
(SIGTERM
) signal. After 30 seconds, an
ACPI G3 Mechanical Off
(SIGKILL
) signal is sent to the instance operating system. The VM then
transitions the instance to a TERMINATED
state.
Spot VMs are a good choice for distributed, fault-tolerant workloads that don't require continuous availability of a single instance. Examples of this type of workload include video encoding, rendering for visual effects, data analytics, simulation, and genomics. However, because of availability limitations and potentially frequent interruptions resulting from preemptions, Spot VMs are generally not recommended for web-facing and user-facing applications.
This tutorial walks through setting up a deployment that uses a combination of Spot VMs and standard VMs on GKE to help reliably serve web application traffic without any disruption.
Challenges of using Spot VMs
The biggest challenge of using Spot VMs in serving user-facing traffic is to ensure that user requests are not disrupted. On preemption, you must address the following types of questions:
- How do you ensure an application's availability when it's running on Spot VMs? Spot VMs don't have guaranteed availability and are explicitly excluded from Compute Engine service level agreements.
- How do you handle graceful termination of the application such that the
following statements are true:
- The load balancer stops forwarding requests to the Pods that are running on an instance that's being preempted.
- In-flight requests are gracefully handled; they either complete or are shut down.
- Connections to your databases and the application are closed or drained before the instance is shut down.
- How do you handle requests, such as business-critical transactions, that might require uptime guarantees or that are not fault tolerant?
Consider the challenges in gracefully shutting down containers that are running on Spot VMs. From an implementation standpoint, the easiest way to write cleanup logic when an instance is shutting down is through a shutdown script. However, shutdown scripts are not supported if you are running containerized workloads in GKE.
Alternatively, you can use a SIGTERM
handler in your application to write
cleanup logic. Containers also provide lifecycle hooks such as
preStop
,
which is triggered just before the container is shut down. In Kubernetes,
kubelet
is responsible for executing container lifecycle events. Within a Kubernetes
cluster, kubelet runs on the VM and watches for Pod specs through the
Kubernetes API server.
When you evict a Pod using a command line or API, kubelet sees that the Pod
has been marked as terminating and begins the shutdown process. The process
duration is limited by the "grace period," which is defined as a set number of
seconds after which kubelet sends a SIGKILL
signal to the containers. As part
of the graceful shutdown, if any container running in the Pod has defined a
preStop
hook, the kubelet runs the hook inside the container. Next, kubelet
triggers a SIGTERM
signal to process-ID 1
inside each container. If the
application is running a SIGTERM
handler, then the handler is executed. When
the handler is done, kubelet sends a SIGKILL
signal to any processes still
running in the Pod.
Imagine that a Kubernetes node is undergoing preemption. You need a
mechanism to catch the preemption notice and start the process of evicting the
Pod. Suppose that you are running a program that listens for a preemption notice
and evicts running Pods upon receiving an event. On eviction, the Pod shutdown
sequence described previously gets triggered. However, in this case, the node is
also undergoing a shutdown, which is handled by the node's operating system
(OS). This shutdown can potentially interfere with kubelet's handling of the
container lifecycle, which means that the container can be abruptly shut down
even if it's in the middle of executing a preStop
hook.
Furthermore, from the standpoint of availability and traffic management, running your web-facing application exclusively on Spot VMs can also pose various challenges. Before using Spot VMs, consider the following questions:
- What happens if most of the Spot VMs are preempted at once? For applications serving thousands of requests per second, how do the requests failover without disruption?
- What happens if Spot VM capacity is not available? How do you scale out or maintain a steady state of deployment for your application in that case?
Architecture
To solve the challenges of using Spot VMs, you must do all of the following:
- Run a GKE cluster with two node pools—one running Spot VMs and the other running standard VMs. This lets you split traffic and have an active failover to handle new and in-flight requests in the event of a preemption. This approach also lets you split traffic between standard VMs and Spot VMs based on your requirements of uptime guarantees and fault tolerance. For more information about deciding the size of node pools, see what to consider.
- Listen for preemption notices on Spot VMs and evict Pods that are running on the node.
- Use a
preStop
hook orSIGTERM
handler to execute cleanup logic. - Ensure that kubelet is allowed to handle the Pod termination lifecycle and is not abruptly shut down.
- Taint the node so that no new Pods are scheduled on it while it is being preempted.
The following diagram shows a high-level view of the architecture that you deploy in this tutorial.
As the diagram shows, you create two
node pools:
default-pool
, which runs on standard VMs, and pvm-pool
, which runs on
Spot VMs. The default GKE scheduler tries to evenly
distribute Pods across all the instances in node pools. For example, if you are
deploying four replicas and have two nodes running in each node pool, the
scheduler provisions one Pod each on all four nodes. However, when using
Spot VMs, you might want to direct more traffic towards pvm-pool
to
get higher Spot VM utilization and therefore cost savings. For
example, you might want to provision three Pods in pvm-pool
and one Pod in
default-pool
for failover, thereby reducing the number of standard VMs in the
cluster.
To achieve this kind of control over scheduling, you can either write
your own scheduler
or split the application into two deployments. Each deployment will be pinned to
a node pool based on
node affinity rules.
In this example, you create two deployments, web-std
and web-pvm
, where
web-std
is pinned to default-pool
and web-pvm
is pinned to pvm-pool
based on the affinity rule.
For pvm-pool
nodes prior to GKE version 1.20, you must listen for a
preemption notice
and, upon receiving it, start evicting Pods. You can write your own logic to
listen for the preemption notice and evict Pods that are running on the node.
Or, as in this tutorial (at the optional step), you can create an agent by using the
k8s-node-termination-handler
event handler.
k8s-node-termination-handler
uses a Kubernetes daemonset to create an agent
on every instance in pvm-pool
. The agent watches for a
node termination event by using Compute Engine metadata APIs.
Whenever a termination event is observed, the agent starts the Pod eviction
process. In this example, 20 seconds are allocated as a grace period for regular
Pods and 10 seconds for system Pods. This means that if there is a preStop
hook or SIGTERM
handler configured for the Pod, it can run up to 20 seconds
before the Pod is shut down by SIGKILL
. The grace period is a configurable
parameter in k8s-node-termination-handler
. The total grace period for regular
and system Pods cannot exceed more than the preemption notice, 30 seconds.
The agent also taints the node to prevent new Pods from being scheduled.
When you evict Pods, the preStop
hook gets triggered. In this example,
preStop
is configured to do two tasks:
- Fail the health check of the application. This is done to signal to the load balancer that it must remove the Pod from the request serving path.
Sleep for the duration of the grace period (20 seconds) allotted by
k8s-node-termination-handler
. The Pod stays alive for 20 seconds to process any in-flight requests.
While the preStop
hook is executing, to ensure that kubelet is not abruptly
shut down by the node OS, you create a systemd service that blocks the shutdown.
This approach helps ensure that kubelet can manage the lifecycle of Pods during
shutdown without the node OS's interference. You use a Kubernetes daemonset to
create the service. This daemonset will run on every instance in pvm-pool
.
This example uses Traffic Director to manage traffic. Note that you can use any proxy solution such as OSS Envoy, Istio, Nginx, or HAProxy to manage traffic as long as you adhere to the subsystem guidelines used in this example, where Traffic Director is configured to do the following:
Enable weighted routing of the requests. In this example, you create three application replicas on
pvm-pool
and one ondefault-pool
. Traffic Director is configured to split traffic 75-25% betweenpvm-pool
anddefault-pool
. In case of preemption, all the requests are automatically failed over todefault-pool
.This tutorial provides a simplistic example of traffic splitting. You can also set match conditions on traffic ports, header fields, URIs, and more to route the request to a specific node pool. For more information, see advanced traffic routing techniques using Traffic Director.
In case of preemption, if the request results in a
5xx
status code (for example, due to a gateway not being available or an upstream connection timing out), Traffic Director retries the requests up to three times.Use a circuit breaker to limit the maximum number of retries that can be outstanding at any given time.
Use outlier detection to evict unhealthy endpoints from the load balancer's serving path.
Traffic Director uses the sidecar proxy model. In this model, the following events occur:
- Clients send requests to a Google Cloud–managed load balancer.
- The load balancer sends the traffic to an edge proxy configured by Traffic Director.
- The edge proxy applies predefined policies (such as request distribution between different endpoints) and retrial and circuit-breaking policies, and it load-balances the requests to services in the GKE cluster.
- The sidecar proxy intercepts the traffic and forwards it to the application.
For more information, see how traffic interception and forwarding works with Traffic Director.
Objectives
- Deploy a GKE cluster with two node pools using standard VMs and Spot VMs.
- Deploy a sample application.
- Configure cluster resources to gracefully handle preemption.
- Configure Traffic Director to control traffic to GKE services.
- Simulate preemption of a Spot VM.
- Verify that the Pods are gracefully shut down.
- Verify that new requests are not forwarded to Pods that are undergoing preemption.
- Verify that in-flight and new requests are served without any disruption.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Before you begin
-
In the Google Cloud console, go to the project selector page.
-
Select or create a Google Cloud project.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
-
Enable the Compute Engine, GKE, Container Analysis, Cloud Build, Container Registry, and Traffic Director API.
- Find your project ID and set it in Cloud Shell. Replace
YOUR_PROJECT_ID
with your project ID.gcloud config set project YOUR_PROJECT_ID
- Export the following environment variables:
export PROJECT=$(gcloud config get-value project) export CLUSTER=$PROJECT-gke export REGION="us-central1" export ZONE="us-central1-c"
Make sure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project.
Create a GKE cluster
In Cloud Shell, create a GKE cluster that has a default node pool,
default-pool
, with standard instances, and a custom node pool,pvm-pool
, with Spot VMs:gcloud container clusters create $CLUSTER \ --zone=$ZONE \ --num-nodes="1" \ --enable-ip-alias \ --machine-type="n1-standard-4" \ --scopes=https://www.googleapis.com/auth/cloud-platform gcloud container node-pools create "pvm-pool" \ --cluster=$CLUSTER \ --zone=$ZONE \ --spot \ --machine-type="n1-standard-4" \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --num-nodes="1"
Optional: Implement graceful shutdown for GKE versions 1.20 and earlier
Clone the
solutions-gke-pvm-preemption-handler
code repository that you use for this tutorial:git clone https://github.com/GoogleCloudPlatform/solutions-gke-pvm-preemption-handler && \ cd solutions-gke-pvm-preemption-handler
Create a daemonset that runs on Spot VM instances in the GKE cluster, and create a systemd service that blocks the shutdown of the kubelet process:
kubectl apply -f daemonset.yaml
Get the name of the Spot VM instance deployed as part of the
pvm-pool
node pool:PVM=$(kubectl get no \ -o=jsonpath='{range .items[*]} \ {.metadata.name}{"\n"}{end}' | grep pvm)
Verify that the service is deployed correctly:
Create a firewall rule in order to use SSH to connect to the Spot VM through IAP forwarding:
gcloud compute firewall-rules create allow-ssh-ingress-from-iap \ --direction=INGRESS \ --action=allow \ --rules=tcp:22 \ --source-ranges=35.235.240.0/20
Use SSH to connect to the Spot VM:
gcloud compute ssh $PVM --tunnel-through-iap --zone=$ZONE
In the Spot VM terminal, check the status of the deployed service:
systemctl status delay.service
You see the state of the service as
Active (exited).
... delay.service - Delay GKE shutdown Loaded: loaded (/etc/systemd/system/delay.service; enabled; vendor preset: disabled) Active: active (exited) since Tue 2020-07-21 04:48:33 UTC; 1h 17min ago ...
Exit the Spot VM terminal by typing
exit
.
Deploy
k8s-node-termination-handler
:Clone the repository:
export TERMINATION_HANDLER="https://github.com/GoogleCloudPlatform/k8s-node-termination-handler" git clone $TERMINATION_HANDLER
In a text editor, open the
k8s-node-termination-handler/deploy/k8s.yaml
file and look for the following line:args: ["--logtostderr", "--exclude-pods=$(POD_NAME):$(POD_NAMESPACE)", "-v=10", "--taint=cloud.google.com/impending-node-termination::NoSchedule"]
Replace the previous line with the following line, which allocates a 10-second grace period to shut down system Pods. The remaining 20 seconds in the grace period is automatically allocated to regular Pods.
args: ["--logtostderr", "--exclude-pods=$(POD_NAME):$(POD_NAMESPACE)", "-v=10", "--taint=cloud.google.com/impending-node-termination::NoSchedule", "--system-pod-grace-period=10s"]
Locate the following line of code:
- key: cloud.google.com/gke-preemptible
Make the following replacement:
Replace the 'gke-preemptible' with 'gke-spot'.
Deploy the handler:
kubectl apply \ -f k8s-node-termination-handler/deploy/k8s.yaml \ -f k8s-node-termination-handler/deploy/rbac.yaml
Verify that the node termination handler was deployed correctly:
kubectl get ds node-termination-handler -n kube-system
The output is similar to the following:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-termination-handler 1 1 1 1 1 <none> 101s
Deploy the application
In Cloud Shell, configure firewall rules for health checks:
gcloud compute firewall-rules create fw-allow-health-checks \ --action=ALLOW \ --direction=INGRESS \ --source-ranges=35.191.0.0/16,130.211.0.0/22 \ --rules tcp
Locate the Dockerfile folder, build the docker image and upload:
cd ~/solutions-gke-pvm-preemption-handler/app gcloud builds submit --tag gcr.io/$PROJECT/hello-world
Modify the reference to the newly created docker image:
cd ~/solutions-gke-pvm-preemption-handler sed -i -e "s/\[PROJECT_ID\]/$PROJECT/g" deploy.yaml
Deploy the application:
kubectl apply -f deploy.yaml
Verify that there are two different deployments running on
default-pool
andpvm-pool
:For
default-pool
, run the following:kubectl get po -l app=web-std \ -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
The output is similar to the following:
NAME Node web-std-695b5fb6c4-55gcc gke-vital-octagon-109612-default-pool-dcdb8fe5-2tc7
For
pvm-pool
, run the following:kubectl get po -l app=web-pvm \ -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
The output is similar to the following.
NAME Node web-pvm-6f867bfc54-nm6fb gke-vital-octagon-109612-gke-pvm-pool-664ec4ff-2cgc
For each service, a standalone NEG is created containing endpoints that are the Pod's IP addresses and ports. For more information and examples, see Standalone network endpoint groups.
Confirm that the standalone NEG was created:
gcloud compute network-endpoint-groups list
The output is similar to the following.
NAME LOCATION ENDPOINT_TYPE SIZE k8s1-be35f81e-default-web-pvm-80-7c99357f us-central1-c GCE_VM_IP_PORT 1 k8s1-be35f81e-default-web-std-80-f16dfcec us-central1-c GCE_VM_IP_PORT 1
To manage the application by using Traffic Director, you must deploy the deployments as network endpoint groups (NEGs). As discussed in the Architecture section, you use a sidecar proxy to create the deployments.
Verify that the deployments are created with a sidecar proxy:
kubectl get pods -l app=web-std \ -o jsonpath={.items[*].spec.containers[*].name}
The output is similar to the following.
hello-app istio-proxy
To see similar results for Pods that are running in
pvm-pool
, you can run the following:kubectl get pods -l app=web-pvm \ -o jsonpath={.items[*].spec.containers[*].name}
Create the Traffic Director service
Traffic Director uses a configuration similar to other Cloud Load Balancing products—in other words, you must configure the following components for Traffic Director:
- A health check. For more information on health checks, see Health check concepts.
- A backend service. For more information on backend services, see Backend services.
- A route rule. Create a forwarding rule and a URL map. For more information, see Using forwarding rules and Using URL maps.
In Cloud Shell, find the NEGs that you created previously and store their names in a variable:
For the
default-pool
service, use the following:NEG_NAME_STD=$(gcloud compute network-endpoint-groups list \ | grep web-std | awk '{print $2}')
For the
pvm-pool
service, use the following:NEG_NAME_PVM=$(gcloud compute network-endpoint-groups list \ | grep web-pvm | awk '{print $2}')
Create the health check:
gcloud compute health-checks create http td-gke-health-check \ --request-path=/health \ --use-serving-port \ --healthy-threshold=1 \ --unhealthy-threshold=2 \ --check-interval=2s \ --timeout=2s
Replace the placeholders in the manifest files:
sed -i -e "s/\[PROJECT_ID\]/$PROJECT/g" td-gke-service-config.yaml sed -i -e "s/\[ZONE\]/$ZONE/g" td-gke-service-config.yaml sed -i -e "s/\[NEG_NAME_STD\]/$NEG_NAME_STD/g" td-gke-service-config.yaml sed -i -e "s/\[NEG_NAME_PVM\]/$NEG_NAME_PVM/g" td-gke-service-config.yaml sed -i -e "s/\[PROJECT_ID\]/$PROJECT/g" td-urlmap.yaml
Create a Traffic Director service:
gcloud compute backend-services import td-gke-service \ --source=td-gke-service-config.yaml --global
The service is configured to split traffic by using capacity scaler between the two NEGs that you created earlier. The service is also configured with outlier detection:
- For outlier detection, set Consecutive errors (or gateway failures
before a host is evicted from the service) to
2
. - For circuit breaking, set Max retries to
3
.
- For outlier detection, set Consecutive errors (or gateway failures
before a host is evicted from the service) to
Verify that the Traffic Director service is deployed correctly:
gcloud compute backend-services get-health td-gke-service --global
The output is similar to the following:
‐‐‐ backend: <var>default-pool-service-NEG</var> status: healthStatus: ‐ healthState: HEALTHY ... ‐‐‐ backend: <var>pvm-pool-service-NEG</var> status: healthStatus: ‐ healthState: HEALTHY ...
You might have to wait for a few minutes, and run the command multiple times, before the backend is shown as
HEALTHY
.Create a URL map that uses the service you created:
gcloud compute url-maps import web-service-urlmap \ --source=td-urlmap.yaml
The URL map sets up traffic routing. All requests to the path "
/*
" are redirected to the Traffic Director service that you created. Additionally, the map also sets up a policy to retry requests (maximum 3 times) that resulted in a5xx
status code.Create the target HTTP proxy:
gcloud compute target-http-proxies create td-gke-proxy \ --url-map=web-service-urlmap
Create the forwarding rule that uses the virtual IP (VIP) address 0.0.0.0:
gcloud compute forwarding-rules create td-gke-forwarding-rule \ --global \ --load-balancing-scheme=INTERNAL_SELF_MANAGED \ --address=0.0.0.0 \ --target-http-proxy=td-gke-proxy \ --ports=80
At this point, GKE services in
default-pool
andpvm-pool
are accessible on the service VIP that is load-balanced by Traffic Director.
Create the load balancer
In this section, you configure a load balancer and an edge proxy for user traffic. The load balancer acts as a gateway to the setup that you just created.
In Cloud Shell, create a Traffic Director-managed edge proxy:
kubectl apply -f edge-proxy.yaml
Verify that the load balancer is healthy and ready to serve traffic:
kubectl describe ingress gateway-proxy-ingress
The output is similar to the following:
... Host Path Backends ‐‐‐‐ ‐‐‐‐ ‐‐‐‐‐‐‐‐ * * gateway-proxy-svc:80 (10.20.0.14:80) Annotations: ingress.kubernetes.io/backends: {"k8s1-da0dd12b-default-gateway-proxy-svc-80-b3b7b808":"HEALTHY"} ...
The backend should be in the
HEALTHY
state. It might take several minutes, and multiple retries of the command, for the load balancer to be ready to accept traffic.Record the IP address for use later:
IPAddress=$(kubectl get ingress gateway-proxy-ingress \ -o jsonpath="{.status.loadBalancer.ingress[*].ip}")
Generate traffic
Now that configuration is complete, it's time to test it.
In Cloud Shell, click Open a new tab + to start a new Cloud Shell session.
In the new shell, set the project ID:
gcloud config set project YOUR_PROJECT_ID
Install Kubetail to observe multiple application Pod logs together:
sudo apt-get update sudo apt-get install kubetail kubetail web
In the original shell, simulate traffic:
seq 1 100 | xargs -I{} -n 1 -P 10 curl -I http://$IPAddress
This command generates 100 requests, 10 parallel requests at a time.
In the new shell, you see logs similar to the following:
--- [web-pvm-6f867bfc54-nm6fb hello-app] Received request at: 2020-07-20 20:26:23.393 [web-pvm-6f867bfc54-nm6fb hello-app] Received request at: 2020-07-20 20:26:23.399 [web-std-6f867bfc54-55gcc hello-app] Received request at: 2020-07-20 20:26:24.001 ...
Verify that requests are distributed between node pools based on the 75%-25% split that you configured earlier.
Back in the original shell, generate traffic by using
httperf
:sudo apt-get install httperf && \ httperf --server=$IPAddress --port=80 --uri=/ \ --num-conns=5000 --rate=20 --num-calls=1 \ --print-reply=header
This command installs
httperf
and generates 5,000 requests against the example application at the rate of 20 requests per second. This test runs for ~250 seconds.
Simulate preemption
- In the new shell, exit the Kubetail command by pressing Ctrl+C.
Get the name of the Spot VM instance deployed as part of
pvm-pool
:PVM=$(kubectl get no \ -o=jsonpath='{range .items[*]} \ {.metadata.name}{"\n"}{end}' | grep pvm)
While the
httperf
test is in progress, trigger a maintenance event on the Spot VM instance to simulate preemption:gcloud compute instances simulate-maintenance-event $PVM \ --zone=$ZONE
Observe the application logs in the new shell:
kubectl logs -f deploy/web-pvm -c hello-app
When the Pod receives a
SIGTERM
signal, the health check status for the application explicitly reportsfail
:... Health status fail at: 2020-07-21 04:45:43.742 ...
The Pod continues to receive requests until it is removed from the request serving path because of failed health checks. This removal might take a few seconds to propagate.
... Received request at: 2020-07-21 04:45:45.735 Received request at: 2020-07-21 04:45:45.743 Health status fail at: 2020-07-21 04:45:45.766 ...
After a few seconds, the Pod stops receiving new requests. The Pod continues to live for 20 seconds. (The
preStop
handler is made to sleep for 20 seconds to allow cleanup activities, including any in-flight requests.) The Pod is then shut down.... Health status fail at: 2020-07-21 04:46:01.796 2020-07-21 04:46:02.303 INFO 1 --- [ Thread-3] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@27ddd392: startup date [Tue Jul 21 04:39:44 UTC 2020]; root of context hierarchy Exiting PreStop hook ...
Optionally, execute the command
kubetail web
to tail logs from all application Pods that are running across both node pools. The output shows requests being routed todefault-pool
as the Spot VM is undergoing preemption:... [web-pvm-6f867bfc54-nm6fb] Received request at: 2020-07-20 20:45:45.743 [web-pvm-6f867bfc54-nm6fb] Health status fail at: 2020-07-21 04:45:45.766 [web-std-6f867bfc54-55gcc] Received request at: 2020-07-20 04:45:45.780 [web-std-6f867bfc54-55gcc] Received request at: 2020-07-20 04:45:45.782 ...
Post-preemption validations
In the original shell, wait for the
Httperf
test to complete. On completion, the output is similar to the following.... Reply status: 1xx=0 2xx=5000 3xx=0 4xx=0 5xx=0 ... Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0 ...
The output indicates that post-preemption the requests were served by
default-pool
, andpvm-pool
was gracefully shut down.Verify that the Spot VM is up again and that the original state of the cluster is restored:
kubectl get po -l app=web-pvm \ -o=custom-columns=NAME:.metadata.name,Node:.spec.nodeName
Notice that two replicas are provisioned on
pvm-pool
:NAME Node web-pvm-6f867bfc54-9z2cp gke-vital-octagon-109612-gke-pvm-pool-664ec4ff-49lx
Considerations
Before running this solution in production, consider these qualifications:
- This tutorial does not employ Pod or cluster autoscaling policies. For production environments, make sure you have the right autoscaling in place to handle traffic spikes.
- Carefully consider the split between standard VMs and Spot VMs in the
cluster. For example, suppose you are running 100 Spot VMs and only 1 standard
VM, and 50% of Spot VMs undergo a preemption at once. In this case, the
standard VMs pool will take some time to scale out to make up for the
preempted resources. In the meantime, user traffic is impacted. To mitigate
large preemptions, you can use third-party applications, like
Spot
and
estafette-gke-preemptible-killer
, to spread out the preemptions to prevent multiple instances from going down together. - Based on your use case, carefully test the grace period that's allocated to
regular and system Pods by using
k8s-node-termination-handler
. Given the criticality of the application, you might have to allocate more than 20 seconds to the regular Pods. A possible downside of this approach is that it might not leave sufficient time for system Pods to cleanly shut down. This can result in possible loss of logs and monitoring metrics that are handled by system Pods.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the Google Cloud project that you created for this tutorial, or delete the resources associated with this tutorial.
Delete the Google Cloud project
The easiest way to eliminate billing is to delete the project you created for the tutorial.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the resources
If you want to keep the project that you used in this tutorial, delete the individual resources.
In Cloud Shell, delete the GKE cluster:
gcloud container clusters delete $CLUSTER --zone=$ZONE --async
Delete the firewall rules:
gcloud compute firewall-rules delete allow-ssh-ingress-from-iap && \ gcloud compute firewall-rules delete fw-allow-health-checks
Delete the Traffic Director service:
gcloud compute forwarding-rules delete td-gke-forwarding-rule \ --global && \ gcloud compute target-http-proxies delete td-gke-proxy \ --global && \ gcloud compute url-maps delete web-service-urlmap \ --global && \ gcloud compute backend-services delete td-gke-service \ --global && \ gcloud compute health-checks delete td-gke-health-check \ --global
Delete all NEGs:
NEG_NAME_EDGE=$(gcloud compute network-endpoint-groups list \ | grep gateway-proxy | awk '{print $1}') && \ gcloud compute network-endpoint-groups delete $NEG_NAME_EDGE \ --zone=$ZONE && \ gcloud compute network-endpoint-groups delete $NEG_NAME_STD \ --zone=$ZONE && \ gcloud compute network-endpoint-groups delete $NEG_NAME_PVM \ --zone=$ZONE
Delete downloaded code, artifacts, and other dependencies:
cd .. && rm -rf solutions-gke-pvm-preemption-handler
What's next
- Learn more about best practices for running cost-optimized Kubernetes applications on GKE.
- Learn more about optimizing resources in multi-tenant GKE clusters with auto-provisioning.
- Learn more about cost optimizations on Google Cloud for developers and operators.