Mitigating security incidents

Standard

This document describes common mitigations and responses to potential security incidents on your Google Kubernetes Engine (GKE) clusters and containers.

The suggestions in Hardening your cluster's security can improve the security of your GKE workloads. Security incidents, however, can occur even when measures to protect your workloads are in place.

Detecting incidents

To detect potential incidents, we recommend you set up a process that collects and monitors your workload's logs. Then, set up alerts based on abnormal events detected from logs. Alerts notify your security team when something unusual is detected. Your security team can then review the potential incident.

Generating alerts from logs

You can customize alerts based on specific metrics or actions. For example, alerting on high CPU usage on your GKE nodes may indicate they are compromised for cryptomining.

Alerts should be generated where you aggregate your logs and metrics. For example, you can use GKE Audit Logging in combination with logs-based alerting in Cloud Logging.

To learn more about security-relevant queries, see the Audit Logging documentation.

Responding to a security incident

After you have been alerted to an incident, take action. Fix the vulnerability if you can. If you do not know the root cause of the vulnerability or do not have a fix ready, apply mitigations.

The mitigations you might take depend on the severity of the incident and your certainty that you have identified the issue.

This guide covers actions you can take after you detect an incident on a workload running on GKE. You could, in increasing order of severity:

Snapshot the host VM's disk. A snapshot lets you perform some forensics on the VM state at the time of the anomaly after the workload has been redeployed or deleted.
Inspect the VM while the workload continues to run. Connecting to the host VM or workload container can provide information about the attacker's actions. We recommend you reduce access before inspecting the live VM.
Redeploy a container. Redeploying ends currently running processes in the affected container and restarts them.
Delete a workload. Deleting the workload ends currently running processes in the affected container without a restart.

These mitigations are described in the following sections.

Before you begin

The methods used in this topic use the following information:

The name of the Pods you think have been compromised, or POD_NAME.
The name of the host VM running the container or Pods, or NODE_NAME.

Also, before taking any of the actions, consider if there will be a negative reaction from the attacker if they are discovered. The attacker may decide to delete data or destroy workloads. If the risk is too high, consider more drastic mitigations such as deleting a workload before performing further investigation.

Snapshot the VM's disk

Creating a snapshot of the VM's disk allows forensic investigation after the workload has been redeployed or deleted. Snapshots can be created while disks are attached to running instances.

To snapshot your persistent disk, first find the disks attached to your VM. Run the following command and look at the source field:
```
gcloud compute instances describe NODE_NAME --zone COMPUTE_ZONE \
    --format="flattened([disks])"
```
Look for the lines that contain disks[NUMBER].source. The output is similar to the following:
```
disks[0].source: https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/zones/COMPUTE_ZONE/disks/DISK_NAME
```
The disk name is the portion of the source name after the final slash. For example disk name is gke-cluster-pool-1-abcdeffff-zgt8.
To complete the snapshot, run the following command:
```
gcloud compute disks snapshot DISK_NAME
```

For more information, see Creating persistent disk snapshots in the Compute Engine documentation.

Inspect the VM while the workload continues to run

Also, consider what access an attacker may have before taking action. If you suspect a container has been compromised and are concerned about informing the attacker, you can connect to the container and inspect it. Inspecting is useful for quick investigation before taking more disruptive actions. Inspecting is also the least disruptive approach to the workload, but it doesn't stop the incident.

Alternatively, to avoid logging into a machine with a privileged credential, you can analyze your workloads by setting up live forensics (such as GRR Rapid Response), on-node agents, or network filtering.

Reduce access before inspecting the live VM

By cordoning, draining, and limiting network access to the VM hosting a compromised container, you can partially isolate the compromised container from the rest of your cluster. Limiting access to the VM reduces risk but does not prevent an attacker from moving laterally in your environment if they take advantage of a critical vulnerability.

Cordon the node and drain the other workloads from it

Cordoning and draining a node moves workloads colocated with the compromised container to other VMs in your cluster. Cordoning and draining reduces an attacker's ability to impact other workloads on the same node. It does not necessarily prevent them from inspecting a workload's persistent state (for example, by inspecting container image contents).

Use kubectl to cordon the node and ensure that no other Pods are scheduled on it:
```
kubectl cordon NODE_NAME
```
Caution: Draining evicts other workloads from the same node, which might cause downtime. Using kubectl drain will respect PodDisruptionBudgets, but it assumes Pods are configured to be automatically re-created on other nodes after eviction.

After cordoning the node, drain the node of other Pods.
Label the Pod that you are quarantining:
```
kubectl label pods POD_NAME quarantine=true
```
Replace POD_NAME with the name of the Pod that you want to quarantine.
Drain the node of Pods that are not labeled with quarantine:
```
kubectl drain NODE_NAME --pod-selector='!quarantine'
```

Restrict network access to the node

We recommended blocking both internal and external traffic from accessing the host VM. Next, allow inbound connections from a specific VM on your network or VPC to connect to the quarantined VM.

The first step is to abandon the VM from the Managed Instance Group that owns it. Abandoning the VM prevents the node from being marked unhealthy and auto-repaired (re-created) before your investigation is complete.

To abandon the VM, run the following command:

gcloud compute instance-groups managed abandon-instances INSTANCE_GROUP_NAME \
    --instances=NODE_NAME

Firewall the VM

Creating a firewall between the affected container and other workloads in the same network helps prevent an attacker from moving into other parts of your environment while you conduct further analysis. Since you already drained the VM of other containers, this only affects the quarantined container.

The following instructions on firewalling the VM prevents:

New outbound connections to other VMs in your cluster using an egress rule.
Inbound connections to the compromised VM using an ingress rule.

To firewall the VM off from your other instances, follow these steps for the node that hosts the Pod you want to quarantine:

Tag the instance so you can apply a new firewall rule.

Note: Check that the tag does not conflict with other VMs before applying it.
```
gcloud compute instances add-tags NODE_NAME \
    --zone COMPUTE_ZONE \
    --tags quarantine
```

Create a firewall rule to deny all egress TCP traffic from instances with the quarantine tag:

gcloud compute firewall-rules create quarantine-egress-deny \
    --network NETWORK_NAME \
    --action deny \
    --direction egress \
    --rules tcp \
    --destination-ranges 0.0.0.0/0 \
    --priority 0 \
    --target-tags quarantine

Create a firewall rule to deny all ingress TCP traffic to instances with the quarantine tag. Give this ingress rule a priority of 1, which lets you override it with another rule that allows SSH from a specified VM.

gcloud compute firewall-rules create quarantine-ingress-deny \
    --network NETWORK_NAME \
    --action deny \
    --direction ingress \
    --rules tcp \
    --source-ranges 0.0.0.0/0 \
    --priority 1 \
    --target-tags quarantine

Remove the VM's external IP address

Removing the VM's external IP address breaks any existing network connections outside your VPC.

To remove the external address of a VM, perform the following steps:

Find and delete the access config that associates the external IP with the VM. First find the access config by describing the VM:

gcloud compute instances describe NODE_NAME \
    --zone COMPUTE_ZONE --format="flattened([networkInterfaces])"

Look for the lines that contain name and natIP. They look like the following:

networkInterfaces[0].accessConfigs[0].name:              ACCESS_CONFIG_NAME
networkInterfaces[0].accessConfigs[0].natIP:             EXTERNAL_IP_ADDRESS

Find the value of natIP that matches the external IP you want to remove. Note the name of the access config.

To remove the external IP, run the following command:

gcloud compute instances delete-access-config NODE_NAME \
    --access-config-name "ACCESS_CONFIG_NAME"

SSH to the host VM via an intermediate VM

After you remove the host VM's external IP, you cannot ssh from outside your VPC. You access it from another VM in the same network. For the rest of this section, we refer to this as the intermediate VM.

Prerequisites

An intermediate VM with access to the subnetwork of the host VM. If you do not already have one, create a VM for this purpose.
The internal IP address of the intermediate VM.
An SSH public key from the intermediate VM. To learn more, see Managing SSH Keys

Connecting to the host VM

Add the intermediate VM's public key to the host VM. For more information, see Adding and Removing SSH keys in the Compute Engine documentation.
Add a tag to the intermediate VM.

Note: Check that the tag does not conflict with other VMs in your network before applying it.
```
gcloud compute instances add-tags INTERMEDIATE_NODE_NAME \
  --zone COMPUTE_ZONE \
  --tags intermediate
```
Add an ingress allow rule to override the deny rule you added earlier. To add the rule, run the following command.
```
gcloud compute firewall-rules create quarantine-ingress-allow \
    --network NETWORK_NAME \
    --action allow \
    --direction ingress \
    --rules tcp:22 \
    --source-tags intermediate \
    --priority 0 \
    --target-tags quarantine
```
This rule allows incoming traffic on port 22 (SSH) from VMs in your network with the intermediate tag. It overrides the deny rule with a priority of 0.
Connect to the quarantined VM with using its internal IP:
```
ssh -i KEY_PATH USER@QUARANTINED_VM_INTERNAL_IP
```
Replace the following:
- KEY_PATH: the path to your SSH private key.
- USER: your Google Cloud account's email address.
- QUARANTINED_VM_INTERNAL_IP: the internal IP address.

Redeploy a container

By redeploying your container, you start a fresh copy of the container and delete the compromised container.

You redeploy a container by deleting the Pod that hosts it. If the Pod is managed by a higher-level Kubernetes construct (for example, a Deployment or DaemonSet), deleting the Pod schedules a new Pod. This Pod runs new containers.

Redeploying makes sense when:

You already know the cause of the vulnerability.
You think it takes an attacker significant effort or time to compromise your container again.
You think that the container might quickly get compromised again and you don't want to take it offline, so you plan to place it in a sandbox to limit the impact.

When redeploying the workload, if the possibility of another compromise is high, consider placing the workload in a sandbox environment such as GKE Sandbox. Sandboxing limits access to the host node kernel if the attacker compromises the container again.

To redeploy a container in Kubernetes, delete the Pod that contains it:

kubectl delete pods POD_NAME --grace-period=10

If the containers in the deleted Pod continue to run, you can delete the workload.

To redeploy the container within a sandbox, follow the instructions in Harden workload isolation with GKE Sandbox.

Delete a workload

Deleting a workload, such as a Deployment or DaemonSet, causes all of its member Pods to be deleted. All containers inside those Pods stop running. Deleting a workload can make sense when:

You want to stop an attack in progress.
You are willing to take the workload offline.
Stopping the attack immediately is more important than application uptime or forensic analysis.

To delete a workload, use kubectl delete CONTROLLER_TYPE. For example, to delete a Deployment, run the following command:

kubectl delete deployments DEPLOYMENT

If deleting the workload doesn't delete all associated Pods or containers, you can manually delete the containers using the container runtimes's CLI tool, typically docker. If your nodes run containerd, use crictl.

Docker

To stop a container in using the Docker container runtime, you can use either docker stop or docker kill.

docker stop stops the container by sending a SIGTERM signal to the root process, and waits 10 seconds for the process to exit by default. If the process hasn't exited in that time period, it then sends a SIGKILL signal. You can specify this grace period with the --time option.

docker stop --time TIME_IN_SECONDS CONTAINER

docker kill is the fastest method to stop a container. It sends the SIGKILL signal immediately.

docker kill CONTAINER

You can also stop and remove a container in one command with docker rm -f:

docker rm -f CONTAINER

containerd

If you use the containerd runtime in GKE, you stop or remove containers with crictl.

To stop a container in containerd, run the following command:

crictl stop CONTAINER

To remove a container in containerd, run the following command:

crictl rm -f CONTAINER

Deleting the host VM

If you are unable to delete or remove the container, you can delete the virtual machine that hosts the affected container.

If the Pod is still visible, you can find the name of the host VM with the following command:

kubectl get pods --all-namespaces \
  -o=custom-columns=POD_NAME:.metadata.name,INSTANCE_NAME:.spec.nodeName \
  --field-selector=metadata.name=POD_NAME

To delete the host VM, run the following gcloud command:

gcloud compute instance-groups managed delete-instances INSTANCE_GROUP_NAME \
    --instances=NODE_NAME

Abandoning the instance from the Managed Instance Group reduces the size of the group by one VM. You can manually add one instance back to the group with the following command:

gcloud compute instance-groups managed resize INSTANCE_GROUP_NAME \
    --size=SIZE