This document describes common mitigations and responses to potential security incidents on your Google Kubernetes Engine (GKE) clusters and containers.
The suggestions in Hardening your cluster's security can improve the security of your GKE workloads. Security incidents, however, can occur even when measures to protect your workloads are in place.
Detecting incidents
To detect potential incidents, we recommend you set up a process that collects and monitors your workload's logs. Then, set up alerts based on abnormal events detected from logs. Alerts notify your security team when something unusual is detected. Your security team can then review the potential incident.
Generating alerts from logs
You can customize alerts based on specific metrics or actions. For example, alerting on high CPU usage on your GKE nodes may indicate they are compromised for cryptomining.
Alerts should be generated where you aggregate your logs and metrics. For example, you can use GKE Audit Logging in combination with logs-based alerting in Cloud Logging.
To learn more about security-relevant queries, see the Audit Logging documentation.
Responding to a security incident
After you have been alerted to an incident, take action. Fix the vulnerability if you can. If you do not know the root cause of the vulnerability or do not have a fix ready, apply mitigations.
The mitigations you might take depend on the severity of the incident and your certainty that you have identified the issue.
This guide covers actions you can take after you detect an incident on a workload running on GKE. You could, in increasing order of severity:
- Snapshot the host VM's disk. A snapshot lets you perform some forensics on the VM state at the time of the anomaly after the workload has been redeployed or deleted.
Inspect the VM while the workload continues to run. Connecting to the host VM or workload container can provide information about the attacker's actions. We recommend you reduce access before inspecting the live VM.
Redeploy a container. Redeploying ends currently running processes in the affected container and restarts them.
Delete a workload. Deleting the workload ends currently running processes in the affected container without a restart.
These mitigations are described in the following sections.
Before you begin
The methods used in this topic use the following information:
- The name of the Pods you think have been compromised, or
POD_NAME
. - The name of the host VM running the container or Pods, or
NODE_NAME
.
Also, before taking any of the actions, consider if there will be a negative reaction from the attacker if they are discovered. The attacker may decide to delete data or destroy workloads. If the risk is too high, consider more drastic mitigations such as deleting a workload before performing further investigation.
Snapshot the VM's disk
Creating a snapshot of the VM's disk allows forensic investigation after the workload has been redeployed or deleted. Snapshots can be created while disks are attached to running instances.
To snapshot your persistent disk, first find the disks attached to your VM. Run the following command and look at the
source
field:gcloud compute instances describe NODE_NAME --zone COMPUTE_ZONE \ --format="flattened([disks])"
Look for the lines that contain
disks[NUMBER].source
. The output is similar to the following:disks[0].source: https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/zones/COMPUTE_ZONE/disks/DISK_NAME
The disk name is the portion of the source name after the final slash. For example disk name is
gke-cluster-pool-1-abcdeffff-zgt8
.To complete the snapshot, run the following command:
gcloud compute disks snapshot DISK_NAME
For more information, see Creating persistent disk snapshots in the Compute Engine documentation.
Inspect the VM while the workload continues to run
Also, consider what access an attacker may have before taking action. If you suspect a container has been compromised and are concerned about informing the attacker, you can connect to the container and inspect it. Inspecting is useful for quick investigation before taking more disruptive actions. Inspecting is also the least disruptive approach to the workload, but it doesn't stop the incident.
Alternatively, to avoid logging into a machine with a privileged credential, you can analyze your workloads by setting up live forensics (such as GRR Rapid Response), on-node agents, or network filtering.
Reduce access before inspecting the live VM
By cordoning, draining, and limiting network access to the VM hosting a compromised container, you can partially isolate the compromised container from the rest of your cluster. Limiting access to the VM reduces risk but does not prevent an attacker from moving laterally in your environment if they take advantage of a critical vulnerability.
Cordon the node and drain the other workloads from it
Cordoning and draining a node moves workloads colocated with the compromised container to other VMs in your cluster. Cordoning and draining reduces an attacker's ability to impact other workloads on the same node. It does not necessarily prevent them from inspecting a workload's persistent state (for example, by inspecting container image contents).
Use
kubectl
to cordon the node and ensure that no other Pods are scheduled on it:kubectl cordon NODE_NAME
After cordoning the node, drain the node of other Pods.
Label the Pod that you are quarantining:
kubectl label pods POD_NAME quarantine=true
Replace
POD_NAME
with the name of the Pod that you want to quarantine.Drain the node of Pods that are not labeled with
quarantine
:kubectl drain NODE_NAME --pod-selector='!quarantine'
Restrict network access to the node
We recommended blocking both internal and external traffic from accessing the host VM. Next, allow inbound connections from a specific VM on your network or VPC to connect to the quarantined VM.
The first step is to abandon the VM from the Managed Instance Group that owns it. Abandoning the VM prevents the node from being marked unhealthy and auto-repaired (re-created) before your investigation is complete.
To abandon the VM, run the following command:
gcloud compute instance-groups managed abandon-instances INSTANCE_GROUP_NAME \
--instances=NODE_NAME
Firewall the VM
Creating a firewall between the affected container and other workloads in the same network helps prevent an attacker from moving into other parts of your environment while you conduct further analysis. Since you already drained the VM of other containers, this only affects the quarantined container.
The following instructions on firewalling the VM prevents:
- New outbound connections to other VMs in your cluster using an egress rule.
- Inbound connections to the compromised VM using an ingress rule.
To firewall the VM off from your other instances, follow these steps for the node that hosts the Pod you want to quarantine:
Tag the instance so you can apply a new firewall rule.
gcloud compute instances add-tags NODE_NAME \ --zone COMPUTE_ZONE \ --tags quarantine
Create a firewall rule to deny all egress TCP traffic from instances with the
quarantine
tag:gcloud compute firewall-rules create quarantine-egress-deny \ --network NETWORK_NAME \ --action deny \ --direction egress \ --rules tcp \ --destination-ranges 0.0.0.0/0 \ --priority 0 \ --target-tags quarantine
Create a firewall rule to deny all ingress TCP traffic to instances with the
quarantine
tag. Give this ingress rule apriority
of1
, which lets you override it with another rule that allows SSH from a specified VM.gcloud compute firewall-rules create quarantine-ingress-deny \ --network NETWORK_NAME \ --action deny \ --direction ingress \ --rules tcp \ --source-ranges 0.0.0.0/0 \ --priority 1 \ --target-tags quarantine
Remove the VM's external IP address
Removing the VM's external IP address breaks any existing network connections outside your VPC.
To remove the external address of a VM, perform the following steps:
Find and delete the access config that associates the external IP with the VM. First find the access config by describing the VM:
gcloud compute instances describe NODE_NAME \ --zone COMPUTE_ZONE --format="flattened([networkInterfaces])"
Look for the lines that contain
name
andnatIP
. They look like the following:networkInterfaces[0].accessConfigs[0].name: ACCESS_CONFIG_NAME networkInterfaces[0].accessConfigs[0].natIP: EXTERNAL_IP_ADDRESS
Find the value of
natIP
that matches the external IP you want to remove. Note the name of the access config.To remove the external IP, run the following command:
gcloud compute instances delete-access-config NODE_NAME \ --access-config-name "ACCESS_CONFIG_NAME"
SSH to the host VM via an intermediate VM
After you remove the host VM's external IP, you cannot ssh from outside your VPC. You access it from another VM in the same network. For the rest of this section, we refer to this as the intermediate VM.
Prerequisites
- An intermediate VM with access to the subnetwork of the host VM. If you do not already have one, create a VM for this purpose.
- The internal IP address of the intermediate VM.
- An SSH public key from the intermediate VM. To learn more, see Managing SSH Keys
Connecting to the host VM
- Add the intermediate VM's public key to the host VM. For more information, see Adding and Removing SSH keys in the Compute Engine documentation.
Add a tag to the intermediate VM.
gcloud compute instances add-tags INTERMEDIATE_NODE_NAME \ --zone COMPUTE_ZONE \ --tags intermediate
Add an ingress allow rule to override the deny rule you added earlier. To add the rule, run the following command.
gcloud compute firewall-rules create quarantine-ingress-allow \ --network NETWORK_NAME \ --action allow \ --direction ingress \ --rules tcp:22 \ --source-tags intermediate \ --priority 0 \ --target-tags quarantine
This rule allows incoming traffic on port 22 (SSH) from VMs in your network with the
intermediate
tag. It overrides the deny rule with apriority
of0
.Connect to the quarantined VM with using its internal IP:
ssh -i KEY_PATH USER@QUARANTINED_VM_INTERNAL_IP
Replace the following:
KEY_PATH
: the path to your SSH private key.USER
: your Google Cloud account's email address.QUARANTINED_VM_INTERNAL_IP
: the internal IP address.
Redeploy a container
By redeploying your container, you start a fresh copy of the container and delete the compromised container.
You redeploy a container by deleting the Pod that hosts it. If the Pod is managed by a higher-level Kubernetes construct (for example, a Deployment or DaemonSet), deleting the Pod schedules a new Pod. This Pod runs new containers.
Redeploying makes sense when:
- You already know the cause of the vulnerability.
- You think it takes an attacker significant effort or time to compromise your container again.
- You think that the container might quickly get compromised again and you don't want to take it offline, so you plan to place it in a sandbox to limit the impact.
When redeploying the workload, if the possibility of another compromise is high, consider placing the workload in a sandbox environment such as GKE Sandbox. Sandboxing limits access to the host node kernel if the attacker compromises the container again.
To redeploy a container in Kubernetes, delete the Pod that contains it:
kubectl delete pods POD_NAME --grace-period=10
If the containers in the deleted Pod continue to run, you can delete the workload.
To redeploy the container within a sandbox, follow the instructions in Harden workload isolation with GKE Sandbox.
Delete a workload
Deleting a workload, such as a Deployment or DaemonSet, causes all of its member Pods to be deleted. All containers inside those Pods stop running. Deleting a workload can make sense when:
- You want to stop an attack in progress.
- You are willing to take the workload offline.
- Stopping the attack immediately is more important than application uptime or forensic analysis.
To delete a workload, use kubectl delete CONTROLLER_TYPE
.
For example, to delete a Deployment, run the following command:
kubectl delete deployments DEPLOYMENT
If deleting the workload doesn't delete all associated Pods or containers, you
can manually delete the containers using the container runtimes's CLI tool,
typically docker
. If your nodes run
containerd, use crictl
.
Docker
To stop a container in using the Docker container runtime, you can use either
docker stop
or docker kill
.
docker stop
stops the container by sending a SIGTERM
signal to the root
process, and waits 10 seconds for the process to exit by default. If the
process hasn't exited in that time period, it then sends a SIGKILL
signal.
You can specify this grace period with the --time
option.
docker stop --time TIME_IN_SECONDS CONTAINER
docker kill
is the fastest method to stop a container. It sends the
SIGKILL
signal immediately.
docker kill CONTAINER
You can also stop and remove a container in one command with docker rm -f
:
docker rm -f CONTAINER
containerd
If you use the containerd
runtime in GKE, you stop or
remove containers with crictl
.
To stop a container in containerd
, run the following command:
crictl stop CONTAINER
To remove a container in containerd
, run the following command:
crictl rm -f CONTAINER
Deleting the host VM
If you are unable to delete or remove the container, you can delete the virtual machine that hosts the affected container.
If the Pod is still visible, you can find the name of the host VM with the following command:
kubectl get pods --all-namespaces \
-o=custom-columns=POD_NAME:.metadata.name,INSTANCE_NAME:.spec.nodeName \
--field-selector=metadata.name=POD_NAME
To delete the host VM, run the following gcloud
command:
gcloud compute instance-groups managed delete-instances INSTANCE_GROUP_NAME \
--instances=NODE_NAME
Abandoning the instance from the Managed Instance Group reduces the size of the group by one VM. You can manually add one instance back to the group with the following command:
gcloud compute instance-groups managed resize INSTANCE_GROUP_NAME \
--size=SIZE
What's next
- Performing forensics on containers
- Hardening your cluster's security
- Forensic analysis for GKE applications