This article details the instrumentation and tools used in forensic analysis for apps deployed to Google Kubernetes Engine (GKE).
Building and maintaining a secure environment for your code, apps, and infrastructure is a high priority for any organization. If a security incident occurs, knowing how to respond to it and investigate it is also crucial.
Google Cloud simplifies securing GKE apps by providing advanced security-related features to help secure your cluster and Google Cloud environment. Cloud Audit Logs provide detailed logging that you can use in forensic analysis. Together, advanced security-related features and logging provide a robust platform to help protect your organization's GKE apps.
The goal of this article is to help secure your Google Cloud infrastructure and container-based code in GKE and also help prepare you for a security incident.
Securing your environment and preparing for forensic analysis with Google Cloud requires several critical steps:
- Protecting your Google Cloud environment. Configure Google Cloud and deploy workloads using the appropriate security-related controls and configuration.
- Preparing an incident response plan. Plan how to respond to a security incident.
- Collecting all relevant logs and data sources. Collect logs and appropriate data about your Google Cloud environment in advance and know how to access them.
- Using automated event detection. Configure proactive scanning to alert you to potential security events, misconfigurations, and vulnerabilities.
- Using analytical tooling for forensic analysis. Use analysis tooling to help uncover and document a security incident.
Protecting your Google Cloud environment
Google Cloud provides a range of configurations and tools that you can use to help secure your Google Cloud organization and projects.
- At the infrastructure level, Cloud Identity and Access Management (Cloud IAM) policies, firewall rules, GKE Workload Identity, and GKE PodSecurityPolicies can help you build a security-enhanced, least-privilege Google Cloud environment.
- At the GKE cluster level, you can use container-focused tools like Binary Authorization and Container Analysis in a layered security approach.
- At the network layer, features such as VPCs, load balancers, and Google Cloud Armor provide security-related controls on network traffic.
- At the app level, tools such as Cloud Endpoints and Identity-Aware Proxy (IAP) provide tools to enhance security of GKE apps.
Shared responsibility model
Security in the cloud is a shared responsibility between the cloud provider and the customer. Google Cloud helps protect the underlying infrastructure by providing encryption at rest by default, and by providing capabilities that you can use to help protect your workloads, such as access controls in Cloud IAM and Cloud Audit Logs. For GKE, Google Cloud helps secure the control plane, and customers are responsible for protecting their workloads. For an in-depth discussion of the shared security model, see the blog post Exploring container security: the shared responsibility model in GKE.
Kubernetes and GKE provide several mechanisms to help secure your cluster and its workloads. Building an environment with least-privilege access and appropriate security-related controls reduces the attack surface. Several guides provide detailed information on how to help secure your cluster and its workloads:
- GKE security overview describes the security mechanisms available in GKE to help protect your cluster and workloads.
- Hardening your cluster's security is a prescriptive guide containing our current recommendations for hardening your GKE cluster.
- The container security blog series describes ways to build and run security-enhanced workloads.
Preparing an incident response plan
Effective incident response is crucial for managing and recovering from incidents, as well as for preventing future incidents. For ideas about how to build or improve your incident response plan, see Google Cloud's incident response plan.
The following phases provide a high-level view of the process:
- Identification. Early and accurate identification of incidents is key to strong and effective incident management. The focus of this phase is to monitor security events to detect and report on potential data incidents.
- Coordination. When an incident is reported, the on-call responder reviews and evaluates the nature of the incident report to determine if it represents a potential data incident, and initiates an incident response process.
- Resolution. At this stage, the focus is investigating the root cause, limiting the impact of the incident, resolving any immediate security risks, implementing necessary fixes as part of remediation, and recovering affected systems, data, and services.
- Continuous improvement. With each new incident, you gain new insights that can help you enhance your tools, training, and processes.
Incident Response and Management (IRM)
IRM (alpha) provides end-to-end incident management to help you reduce the time to mitigate incidents. IRM lets you manage the full lifecycles of your incidents and provides holistic data and analytics. IRM also builds in best practices for incident response such as efficient communication, collaboration, and contextual awareness.
Stick to the plan and know when to call in the experts
Before you launch any code to a production cluster, it's critical to understand the security model for your app and infrastructure (including Google Cloud) and build an incident response plan. In your plan, be sure to include an escalation path that describes which teams to involve at what phase of the response.
For example, an incident response could start with the operations team submitting a potential incident that is later classified as a security incident. The incident is then assigned to the appropriate security team members. The incident response plan defines when to call in external security professionals and how to engage them. Developing a process like this is a critical part of preparing an effective incident response.
Collecting all relevant logs and data sources
After you implement necessary security protection mechanisms for your Kubernetes environment and draft an incident response plan, you should ensure that you can access all the necessary information for forensic analysis. You should begin by collecting logs as soon as you deploy an app or set up a Google Cloud project. Capturing the logs helps to ensure they're available if you need them for analysis. In general, some data sources are unique to Google Cloud environments; others are common to containers. Both sources are critical to retain and make available for analysis.
Logs provide a rich data set to help identify specific security events. Each of the following log sources might provide details that you can use in your analysis.
Cloud Audit Logs
Google Cloud services write audit logs called Cloud Audit Logs. These logs help you answer the questions, "Who did what, where, and when?" There are three types of audit logs for each project, folder, and organization: Admin Activity, Data Access, and System Event. These logs collectively help you understand what administrative API calls were made, what data was accessed, and what system events occurred. This information is critical for any analysis. For a list of Google Cloud services that provide audit logs, see Google services with audit logs.
Cloud Logging collects your container standard output and error logs. You
can add other logs by using the
For clusters with Istio and Cloud Logging enabled, the
collects and reports the Istio-specific logs and sends the logs to
Infrastructure logs offer insight into the activities and events at the OS, cluster, and networking levels.
GKE audit logs
GKE sends two types of audit logs: GKE audit logs and Kubernetes Audit Logging. Kubernetes writes audit logs to Cloud Audit Logs for calls made to the Kubernetes API server. Kubernetes audit log entries are useful for investigating suspicious API requests, for collecting statistics, and for creating monitoring alerts for unwanted API calls. In addition, GKE writes its own audit logs that identify what occurs in a GKE cluster.
Compute Engine Cloud Audit Logs for GKE nodes
GKE runs on top of Compute Engine nodes, which generate their own audit logs. In addition, you can configure auditd to capture Linux system logs. auditd provides valuable information such as error messages, login attempts, and binary executions for your cluster nodes. Both the Compute Engine audit logs and the auditd audit logs provide insight into activities that happen at the underlying cluster infrastructure level.
For container and system logs, GKE deploys a per-node logging agent that reads container logs, adds helpful metadata, and then stores the logs. The logging agent checks for container logs in the following sources:
- Standard output and standard error logs from containerized processes
- kubelet and container runtime logs
- Logs for system components, such as VM startup scripts
For events, GKE uses a deployment in the
that automatically collects events and sends them to Cloud Logging.
Logs are collected for clusters, nodes, pods, and containers.
Istio on Google Kubernetes Engine
Auditd for Container-Optimized OS on GKE
For Linux systems, the auditd daemon provides access to OS system-level commands and can provide valuable insight into the events inside your containers. On GKE, you can collect auditd logs and send them to Cloud Logging.
VPC Flow Logs
VPC Flow Logs records a sample of network flows sent from and received by VM instances. This information is useful for analyzing network communication. VPC Flow Logs includes all pod-to-pod traffic through the Intranode Visibility feature in your Kubernetes cluster.
Other Google Cloud services
Google Cloud services generate Cloud Audit Logs, and some services (such as Cloud Load Balancing) generate additional logs. After a service is enabled, Cloud Logging begins generating logs. For Cloud Audit Logs, only Admin Activity logs are enabled by default. You can enable audit logs for the other services when those services are enabled.
Snapshots can be useful for analyzing the contents of storage at a given point. You can take snapshots of the storage attached to a Google Kubernetes Engine cluster node and schedule them at regular intervals. Because snapshots operate at the cluster node level and the cluster nodes might change over time, you should automate taking snapshots whenever a node with new storage is created. Having a regular snapshot of your node's storage helps ensure that you have storage data available for analysis.
Using automated event detection
Automation, along with alerting, is key to monitoring any environment at scale. You can use both Google Cloud and Kubernetes tooling to identify potential threats and misconfigurations.
Security Command Center
Security Command Center is a security and risk platform for Google Cloud. Security Command Center can make it easier for you to prevent, detect, and respond to threats by gathering data, identifying threats, and recommending actions. Many security scanning tools available on Google Cloud report results to Security Command Center, making this platform useful for automated detection. Security Command Center is an important tool for your SecOps or DevSecOps teams.
Event Threat Detection (Event Threat Detection)
Event Threat Detection is built to automatically detect the most serious threats facing organizations and publish results to Security Command Center. Examples of threats include the addition of potentially malicious users and service accounts, compromised Compute Engine instances, and malicious network traffic. Event Threat Detection is powered by Google's open source and closed-source threat intelligence and works by detecting threats in logs available in Cloud Logging.
Security Health Analytics
Security Health Analytics is a Google Cloud service that automatically scans for common vulnerabilities and misconfigurations across Google Cloud offerings. Security Health Analytics can detect vulnerabilities related to containers. Vulnerabilities include logging or monitoring being disabled, and enabling the Kubernetes management web UI dashboard. Findings are written to Security Command Center.
Forseti is an open source project that can build an inventory of your Google Cloud resources, scan the environment, and set policies to enforce. Forseti is integrated with and can report findings to Security Command Center. You can use Forseti to check arbitrary configuration values on your GKE clusters to ensure that they're consistent with your specifications. If you're using Anthos, you can use Anthos Config Management to define common configurations, enforce those configurations, and monitor against configuration drift.
kube-hunter scans for security weaknesses in Kubernetes clusters by using remote, internal, and network scanning. kube-hunter can be used either in interactive mode or as automated remote penetration testing for your cluster.
Using analytical tooling for forensic analysis of containers
Many tools in the Google Cloud and Kubernetes ecosystem are useful for forensic analysis. This section includes three for reference.
Google Cloud log analysis using BigQuery
Cloud Audit Logs can be exported to Cloud Storage, Pub/Sub, or BigQuery for further analysis. For example, you might want to look at all the firewall rule changes in your Google Cloud projects over a given period. To do that, you can export Cloud Audit Logs to BigQuery. Then, using BigQuery, you can construct a SQL query to return that information.
Docker-explorer is a project to help a forensics analyst explore offline Docker file systems. This approach might be useful when a Docker container is compromised.
Kubectl Sysdig Capture + Sysdig Inspect
Kubectl Sysdig Capture is an open source kubectl plugin that triggers a capture of system activity in a pod. Sysdig makes pod information available for Sysdig Inspect, an open source tool used for container troubleshooting and security investigations. Sysdig Inspect visually organizes the granular system, network, and app activity of a Linux system and correlates activities inside a pod.