Mitigating security incidents

This document describes common mitigations and responses to potential security incidents on your Google Kubernetes Engine (GKE) clusters and containers.

The suggestions in Hardening your cluster's security can improve the security of your GKE workloads. Security incidents, however, can occur even when measures to protect your workloads are in place.

Detecting incidents

To detect potential incidents, we recommend you set up a process that collects and monitors your workload's logs. Then, set up alerts based on abnormal events detected from logs. Alerts notify your security team when something unusual is detected. Your security team can then review the potential incident.

Generating alerts from logs

You can customize alerts based on specific metrics or actions. For example, alerting on high CPU usage on your GKE nodes may indicate they are compromised for cryptomining.

Alerts should be generated where you aggregate your logs and metrics. For example, you can use GKE's Audit Loggingin combination with logs-based alerting in Cloud Logging.

To learn more about setting up alerts based on logs from GKE, see Security Controls and forensic analysis for GKE apps. For example security-relevant queries, see the Audit Logging documentation.

Responding to a security incident

After you have been alerted to an incident, take action. Fix the vulnerability if you can. If you do not know the root cause of the vulnerability or do not have a fix ready, apply mitigations.

The mitigations you might take depend on the severity of the incident and your certainty that you have identified the issue.

This guide covers actions you can take after you detect an incident on a workload running on GKE. You could, in increasing order of severity:

  • Snapshot the host VM's disk. A snapshot lets you perform some forensics on the VM state at the time of the anomaly after the workload has been redeployed or deleted.
  • Inspect the VM while the workload continues to run. Connecting to the host VM or workload container can provide information about the attacker's actions. We recommend you reduce access before inspecting the live VM.

  • Redeploy a container. Redeploying kills currently running processes in the affected container and restarts them.

  • Delete a workload. Deleting the workload kills currently running processes in the affected container without a restart.

These mitigations are described in the following sections.

Before you begin

The methods used in this topic use the following information:

  • The name of the pods you think have been compromised, or POD_NAME.
  • The name of the host VM running the container or Pods, or HOST_NODE .

Also, before taking any of the actions, consider if there will be a negative reaction from the attacker if they are discovered. The attacker may decide to delete data or destroy workloads. If the risk is too high, consider more drastic mitigations such as deleting a workload before performing further investigation.

Snapshot the VM's disk

Creating a snapshot of the VM's disk allows forensic investigation after the workload has been redeployed or deleted. Snapshots can be created while disks are attached to running instances.

To snapshot your persistent disk, first find the disks attached to your VM. Run the following command and look at the "source" field:

gcloud compute instances describe [NODE_NAME] --zone [ZONE_NAME] --format="flattened([disks])"

Look for the lines that contain disks[NUMBER].source. An example output follows.

disks[0].source: https://www.googleapis.com/compute/v1/projects/[PROJECT_NAME]/zones/[ZONE]/disks/[DISK_NAME]

The disk name is the portion of the source name after the final slash. For example disk name is gke-cluster-pool-1-abcdeffff-zgt8.

To complete the snapshot, run the following command:

gcloud compute disks snapshot [DISK_NAME]

For more information, see Creating persistent disk snapshots in the Compute Engine documentation.

Inspect the VM while the workload continues to run

Also, consider what access an attacker may have before taking action. If you suspect a container has been compromised and are concerned about informing the attacker, you can connect to the container and inspect it. Inspecting is useful for quick investigation before taking more disruptive actions. Inspecting is also the least disruptive approach to the workload, but it doesn't stop the incident.

Alternatively, to avoid logging into a machine with a privileged credential, you can analyze your workloads by setting up live forensics (such as GRR Rapid Response), on-node agents, or network filtering. For more information on suggested forensics tools, see Security controls and forensic analysis for GKE apps.

Reduce access before inspecting the live VM

By cordoning, draining, and limiting network access to the VM hosting a compromised container, you can partially isolate the compromised container from the rest of your cluster. Limiting access to the VM reduces risk but does not prevent an attacker from moving laterally in your environment if they take advantage of a critical vulnerability.

Cordon the node and drain the other workloads from it

Cordoning and draining a node moves workloads colocated with the compromised container to other VMs in your cluster. Cordoning and draining reduces an attacker's ability to impact other workloads on the same node. It does not necessarily prevent them from inspecting a workload's persistent state (for example, by inspecting container image contents).

  1. Use kubectl to cordon the node and ensure that no other pods are scheduled on it:

    kubectl cordon [NODE_NAME]

    After cordoning the node, drain the node of other pods.

  2. Label the pod you're quarantining:

    kubectl label pods [POD_NAME] quarantine=true

    Where POD_NAME is the name of the pod you want to quarantine.

  3. Drain the node of pods that are not labeled with quarantine:

    kubectl drain [NODE_NAME] --pod-selector='!quarantine'

Restrict network access to the node

We recommended blocking both internal and external traffic from accessing the host VM. Next, allow inbound connections from a specific VM on your network or VPC to connect to the quarantined VM.

The first step is to abandon the VM from the Managed Instance Group that owns it. Abandoning the VM prevents the node from being marked unhealthy and auto-repaired (re-created) before your investigation is complete.

To abandon the VM, run the following command:

gcloud compute instance-groups managed abandon-instances [INSTANCE_GROUP_NAME] --instances=[NODE_NAME]

Firewall the VM

Creating a firewall between the affected container and other workloads in the same network helps prevent an attacker from moving into other parts of your environment while you conduct further analysis. Since you already drained the VM of other containers, this only affects the quarantined container.

The following instructions on firewalling the VM prevents:

  • New outbound connections to other VMs in your cluster using an egress rule.
  • Inbound connections to the compromised VM using an ingress rule.

To firewall the VM off from your other instances, follow these steps for the node that hosts the Pod you want to quarantine:

  1. Tag the instance so you can apply a new firewall rule.

    gcloud compute instances add-tags [NODE_NAME] \
        --zone [ZONE] \
        --tags quarantine
  2. Create a firewall rule to deny all egress TCP traffic from instances with the quarantine tag:

    gcloud compute firewall-rules create quarantine-egress-deny \
        --network [NETWORK_NAME] \
        --action deny \
        --direction egress \
        --rules tcp \
        --destination-ranges 0.0.0.0/0 \
        --priority 0 \
        --target-tags quarantine
  3. Create a firewall rule to deny all ingress TCP traffic to instances with the quarantine tag. Give this ingress rule a priority of 1, which lets you override it with another rule that allows SSH from a specified VM.

    gcloud compute firewall-rules create quarantine-ingress-deny \
        --network [NETWORK_NAME] \
        --action deny \
        --direction ingress \
        --rules tcp \
        --source-ranges 0.0.0.0/0 \
        --priority 1 \
        --target-tags quarantine

Remove the VM's external IP address

Removing the VM's external IP address breaks any existing network connections outside your VPC.

To remove the external address of a VM, perform the following steps.

  1. Find and delete the access config that associates the external IP with the VM. First find the access config by describing the VM:

    gcloud compute instances describe [NODE_NAME] --zone us-central1-a --format="flattened([networkInterfaces])"

    Look for the lines that contain name and natIP. They look like the following:

    networkInterfaces[0].accessConfigs[0].name:                [ACCESS_CONFIG_NAME]
    networkInterfaces[0].accessConfigs[0].natIP:               [EXTERNAL_IP_ADDRESS]
    
  2. Find the value of natIP that matches the external IP you want to remove. Note the name of the access config.

  3. To remove the external IP, run the following command.

    gcloud compute instances delete-access-config [NODE_NAME] --access-config-name "[ACCESS_CONFIG_NAME]"

SSH to the host VM via an intermediate VM

After you remove the host VM's external IP, you cannot ssh from outside your VPC. You access it from another VM in the same network. For the rest of this section, we refer to this as the intermediate VM.

Prerequisites

  • An intermediate VM with access to the subnetwork of the host VM. If you do not already have one, create a VM for this purpose.
  • The internal IP address of the intermediate VM.
  • An SSH public key from the intermediate VM. To learn more, see Managing SSH Keys

Connecting to the host VM

  1. Add the intermediate VM's public key to the host VM. For more information, see Adding and Removing SSH keys in the Compute Engine documentation.
  2. Add a tag to the intermediate VM.

    gcloud compute instances add-tags [INTERMEDIATE_NODE_NAME] \
      --zone [ZONE] \
      --tags intermediate
  3. Add an ingress allow rule to override the deny rule you added earlier. To add the rule, run the following command.

    gcloud compute firewall-rules create quarantine-ingress-allow \
        --network [NETWORK_NAME] \
        --action allow \
        --direction ingress \
        --rules tcp:22 \
        --source-tags intermediate \
        --priority 0 \
        --target-tags quarantine

    This rule allows incoming traffic on port 22 (SSH) from VMs in your network with the intermediate tag. It overrides the deny rule with a priority of 0.

  4. Connect to the quarantined VM with using its internal IP:

    ssh -i [KEY_PATH] [USER]@[QUARANTINED_VM_INTERNAL_IP]

    Where KEY_PATH is the path to your SSH private key, and USER is your Google Cloud account's email address.

Redeploy a container

By redeploying your container, you start a fresh copy of the container and delete the compromised container.

You redeploy a container by deleting the Pod that hosts it. If the Pod is managed by a higher-level Kubernetes construct (for example, a Deployment or DaemonSet), deleting the Pod schedules a new Pod. This Pod runs new containers.

Redeploying makes sense when:

  • You already know the cause of the vulnerability.
  • You think it takes an attacker significant effort or time to compromise your container again.

To redeploy a container in Kubernetes, delete the Pod that contains it.

kubectl delete pods [POD] --grace-period=10

If the containers in the deleted Pod continue to run, you can delete the workload.

Delete a workload

Deleting a workload, such as a Deployment or DaemonSet, causes all of its member Pods to be deleted. All containers inside those Pods stop running. Deleting a workload can make sense when:

  • You want to stop an attack in progress.
  • You are willing to take the workload offline.
  • Stopping the attack immediately is more important than application uptime or forensic analysis.

To delete a workload, use kubectl delete [CONTROLLER_TYPE]. For example, to delete a Deployment, run the following command:

kubectl delete deployments [DEPLOYMENT]

If deleting the workload doesn't delete all associated Pods or containers, you can manually delete the containers using the container runtimes's CLI tool, typically docker. If your nodes run containerd, use crictl.

Docker

To kill a container in using the Docker container runtime, you can use either docker stop or docker kill.

docker stop stops the container by sending a SIGTERM signal to the root process, and waits 10 seconds for the process to exit by default. If the process hasn't exited in that time period, it then sends a SIGKILL signal. You can specify this grace period with the --time option.

docker stop --time [TIME_IN_SECONDS] [CONTAINER] 

docker kill is the fastest method to kill a container. It sends the SIGKILL signal immediately.

docker kill [CONTAINER]

You can also kill and remove a container in one command with docker rm -f.

docker rm -f [CONTAINER]

containerd

If you use the containerd runtime in GKE, you kill or remove containers with crictl.

To kill a container in containerd, run the following command:

crictl stop [CONTAINER]

To remove a container in containerd, run the following command:

crictl rm -f [CONTAINER]

Deleting the host VM

If you are unable to delete or remove the container, you can delete the virtual machine that hosts the affected container.

If the Pod is still visible, you can find the name of the host VM with the following command:

kubectl get pods --all-namespaces \
  -o=custom-columns=POD_NAME:.metadata.name,INSTANCE_NAME:.spec.nodeName \
  --field-selector=metadata.name=[POD_NAME]

To delete the host VM, run the following gcloud command:

gcloud compute instance-groups managed delete-instances [INSTANCE_GROUP_NAME] --instances=[NODE_NAME]

Abandoning the instance from the Managed Instance Group reduces the size of the group by one VM. You can manually add one instance back to the group with the following command:

gcloud compute instance-groups managed resize [INSTANCE_GROUP_NAME] --size=[SIZE]

Next steps

For more information, see: