Managing GKE logs

Stay organized with collections Save and categorize content based on your preferences.

This page provides an overview of the logging options available in Google Kubernetes Engine (GKE).

Overview

When you enable the Cloud Operations for GKE integration with Cloud Logging and Cloud Monitoring for your cluster, your logs are stored in a dedicated, persistent datastore. While GKE itself stores logs, these logs are not stored permanently. For example, GKE container logs are removed when their host Pod is removed, when the disk on which they are stored runs out of space, or when they are replaced by newer logs. System logs are periodically removed to free up space for new logs. Cluster events are removed after one hour.

For container and system logs, GKE deploys, by default, a per-node logging agent that reads container logs, adds helpful metadata, and then stores them in Cloud Logging. The logging agent checks for container logs in the following sources:

  • Standard output and standard error logs from containerized processes

  • kubelet and container runtime logs

  • Logs for system components, such as VM startup scripts

For events, GKE uses a deployment in the kube-system namespace which automatically collects events and sends them to Logging.

What logs are collected

By default, GKE collects several types of logs from your cluster and stores those in Cloud Logging:

  • Audit logs - These include the Admin Activity log, Data Access log, and the Events log. For detailed information about the Audit Logs for GKE, refer to the Audit Logs for GKE documentation. Audit logs for GKE cannot be disabled.

  • System logs – These includes logs from:

    • All Pods running in namespaces kube-system, istio-system, knative-serving, gke-system, and config-management-system.

    • Key services that are not containerized including docker/containerd runtime, kubelet, kubelet-monitor, node-problem-detector, and kube-container-runtime-monitor.

    • The node's serial ports output, if the VM instance metadata serial-port-logging-enable is set to true. As of GKE 1.16-13-gke.400, serial port output for nodes is collected by the Logging agent. To disable serial port output logging, set --metadata serial-port-logging-enable=false during cluster creation. Serial port output is useful for troubleshooting crashes, failed boots, startup issues, or shutdown issues with GKE nodes. Disabling these logs might limit Google's ability to troubleshoot such issues.

  • Application logs – All logs generated by non-system containers running on user nodes.

Optionally, GKE can collect additional types of logs from certain Kubernetes control plane components and store them in Cloud Logging:

  • API server logs - All logs generated by the Kubernetes API server (kube-apiserver).

  • Scheduler logs - All logs generated by the Kubernetes Scheduler (kube-scheduler).

  • Controller Manager logs - All logs generated by the Kubernetes Controller Manager (kube-controller-manager).

To learn more about each of these control plane components, see the cluster architecture for Standard and for Autopilot.

Collecting your logs

When you create a new GKE cluster, Cloud Operations for GKE integration with Cloud Logging and Cloud Monitoring is enabled by default.

System and application logs are delivered to the Logs Router in Cloud Logging.

From there, logs may be ingested into Cloud Logging, excluded, or exported to BigQuery, Pub/Sub, or Cloud Storage.

Default logging throughput

When system logging is enabled, a dedicated Logging agent is automatically deployed and managed. It runs on all GKE nodes in a cluster to collect logs, adds helpful metadata about the container, pod, and cluster, and then sends the logs to Cloud Logging using a fluentbit-based agent.

The dedicated Logging agent guarantees at least 100 KB per second log throughput per node for system and workload logs. If a node is underutilized, then depending on the type of log load (for example, text or structured log entries, very few containers on the node or many containers), the dedicated logging agent might provide throughput as much as 500 KB per second or more. Additionally, in clusters with GKE control plane version 1.24 or later, the Logging agent allows for throughput as high as 10 MB per second on nodes that have at least 2 unused CPU cores. Be aware, however, that at higher throughputs, some logs may be lost.

Identify nodes with higher logging throughput

If your GKE cluster has system metrics enabled, then the kubernetes.io/node/logs/input_bytes metric provides the number of log bytes generated per second on a node. This metric can help you decide which variant of the logging agent makes sense to deploy in your cluster or node pools.

To view the historical logging throughput for each node in your cluster, follow these steps:

  1. In the Google Cloud console, select Monitoring:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. In the Select a metric field, select kubernetes.io/node/logs/input_bytes.

  4. In the Group by field, select project_id, location, cluster_name, and node_name.

  5. Click OK

  6. Optionally, sort the list of metrics in descending order by clicking the column header Value above the list of metrics.

To understand how much logging volume is due to system components or due to workloads running on the node, you may also group by the type metric label.

Increasing logging agent throughput

If any GKE nodes require more than 100 KB per second log throughput and your GKE Standard cluster is using control plane version 1.24.2-gke.300 or later, you may request that GKE deploy an alternative configuration of the Logging agent designed to maximize logging throughput. This maximum throughput Logging variant allows for throughput as high as 10 MB per second per node. You can deploy this high-throughput Logging agent to all nodes in a cluster or to all nodes in a node pool.

This high-throughput configuration will consume additional CPU and memory.

High throughput for all nodes in a cluster

To deploy the high-throughput Logging agent to all nodes in a cluster whose control plane version is 1.24.2-gke.300 or later, pass --logging-variant=MAX_THROUGHPUT to the gcloud container clusters create or gcloud container clusters update commands:

gcloud container clusters create [CLUSTER_NAME] \
  --region=[REGION] \
  --logging-variant=MAX_THROUGHPUT
gcloud container clusters update [CLUSTER_NAME] \
  --region=[REGION] \
  --logging-variant=MAX_THROUGHPUT

High throughput for all nodes in a node pool

To deploy the high-throughput Logging agent to all nodes in a node pool whose cluster control plane version is 1.24.2-gke.300 or later, pass --logging-variant=MAX_THROUGHPUT to the gcloud container node-pools create or gcloud container node-pools update commands:

gcloud container node-pools create [NODE_POOL_NAME] \
  --cluster=[CLUSTER_NAME] \
  --logging-variant=MAX_THROUGHPUT
gcloud container node-pools update [NODE_POOL_NAME] \
  --cluster=[CLUSTER_NAME] \
  --logging-variant=MAX_THROUGHPUT

Modifying the logging variant triggers a nodepool restart. GKE marks the nodepools for recreation, and they are recreated when it is safe to do so. This ensures that nodepool maintenance policies are respected.

As of GKE 1.16-13-gke.400, serial port output for nodes is retained in Cloud Logging. To disable serial port output logging to Cloud Logging, set --metadata serial-port-logging-enable=false during cluster creation.

If you no longer want to use the maximum throughput logging agent, pass --logging-variant=DEFAULT to any of the gcloud container clusters create, update, gcloud container node-pools create, or update commands.

Customizing log collection for system logs only

Beginning with GKE version 1.15.7, you can configure Cloud Operations for GKE to only capture system logs and not collect application logs.

Log collection with custom fluentd or fluentbit

GKE's default logging agent provides a managed solution to deploy and manage the agents that send the logs for your clusters to Cloud Logging. Depending on your GKE cluster master version, either fluentd or fluentbit are used to collect logs. Starting from GKE 1.17, logs are collected using a fluentbit-based agent. GKE clusters using versions prior to GKE 1.17 use a fluentd-based agent. If you want to alter the default behavior of the fluentdagents, then you can run a customized fluentd agent or a customized fluentbit agent.

Common use cases include:

  • removing sensitive data from your logs

  • collecting additional logs not written to STDOUT or STDERR

  • using specific performance-related settings

  • customized log formatting

Collecting Linux auditd logs for GKE nodes

You can enable verbose operating system audit logs on Google Kubernetes Engine nodes running Container-Optimized OS. Operating system logs on your nodes provide valuable information about the state of your cluster and workloads, such as error messages, login attempts, and binary executions.You can use this information to debug problems or investigate security incidents.

To learn more, go to Enabling Linux auditd logs on GKE nodes.

GKE Audit Logs

For detailed information about log entries that apply to the Kubernetes Cluster and GKE Cluster Operations resource types, go to Audit logging.

Logging Access Control

There are two aspects of logging access control: application access and user access. Cloud Logging provides IAM roles that you can use to grant appropriate access.

Application Access

Applications need permission to write logs which is granted though assigning the IAM role roles/logging.logWriter to the service account for an application. When you create a GKE cluster, the roles/logging.logWriter role is enabled by default.

User View Access

You need to have the roles/logging.viewer role to view your logs in your project. If you need to have access to the Data Access logs, you need to have the logging.privateLogViewer IAM permission.

For more information about permissions and roles, go to the Access control guide. You can also review the Best practices for Cloud Audit Logs document, which also applies to Cloud Logging in general.

User Admin Access

IAM roles roles/logging.configWriter and roles/logging.admin provide the administrative capabilities. The roles/logging.configWriter IAM role is required to create a logging sink which is commonly used to direct your logs to a specific or centralized project. For example, you might want to use a logging sink along with a logging filter to direct all of your logs for a namespace to a centralized logging bucket.

To learn more, go to the Access Control guide for Cloud Logging.

Best practices

  • Structured logging: Single-line JSON strings written to standard output or standard error are read into Google Cloud's operations suite as structured log entries. See Structured logging for more details. You can use Advanced logs filters to filter logs based on their fields.
    • Logs generated with glog will have the common fields parsed, for example, severity, pid, source_file, source_line. However, the message payload itself is unparsed and shows up verbatim in the resulting log in Google Cloud's operations suite.
  • Severities: By default, logs written to the standard output are on the INFO level and logs written to the standard error are on the ERROR level. Structured logs can include a severity field, which defines the log's severity.
  • Exporting to BigQuery: For additional analysis, you can export logs to external services, such as BigQuery or Pub/Sub. Logs exported to BigQuery retain their format and structure. See Overview of logs exports for further information.
  • Alerting: When Logging logs unexpected behavior, you can use logs-based metrics to set up alerting policies. For an example, see Creating a simple alerting policy on a counter metric. For detailed information on logs-based metrics, see Overview of logs-based metrics.
  • Error reporting: To collect errors from applications running on your clusters, you can use Error Reporting.

Control plane logs

You can configure a GKE cluster to send logs emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Logging.

Requirements

Sending logs emitted by Kubernetes control plane components to Cloud Logging requires GKE control plane version 1.22.0 or later and requires that the collection of system logs be enabled.

Configuring collection of control plane logs

See the instructions to configure logging support for a new cluster or for an existing cluster.

Pricing

GKE control plane logs are exported to Cloud Logging. Standard Cloud Logging pricing applies.

Quota

Control plane logs consume the "Write requests per minute" quota of the Cloud Logging API. Before enabling control plane logs, check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota's limit, then you can request a quota-limit increase before enabling control plane logs.

Access controls

If you want to limit access within your organization to Kubernetes control plane logs, you can create a separate log bucket with more limited access controls.

By storing them in a separate log bucket with limited access, control plane logs in the log bucket will not automatically be accessible to anyone with roles/logging.viewer access to the project. Additionally, if you decide to delete certain control plane logs due to privacy or security concerns, storing them in a separate log bucket with limited access makes it possible to delete the logs easily without impacting logs from other components or services.