Configuring logging and monitoring

GKE on Bare Metal includes multiple options for cluster logging and monitoring, including cloud-based managed services, open source tools, and validated compatibility with third-party commercial solutions. This page explains these options and provides some basic guidance on selecting the proper solution for your environment.

Options for GKE on Bare Metal

You have several logging and monitoring options for your GKE on Bare Metal clusters:

  • Cloud Logging and Cloud Monitoring, enabled by default on Bare Metal system components.
  • Prometheus and Grafana is available from the Cloud Marketplace.
  • Validated configurations with third-party solutions.

Cloud Logging and Cloud Monitoring

Google Cloud Observability is the built-in observability solution for Google Cloud. It offers a fully managed logging solution, metrics collection, monitoring, dashboarding, and alerting. Cloud Monitoring monitors GKE on Bare Metal clusters in a similar way as cloud-based GKE clusters.

The agents can be configured with two different levels of logging and monitoring:

  • System components only (default).
  • System components and applications.

Logging and Monitoring provide an ideal solution if you want a single, easy-to-configure, powerful cloud-based observability solution. We highly recommend Logging and Monitoring when running workloads only on GKE on Bare Metal, or workloads on GKE and GKE on Bare Metal. For applications with components running on GKE on Bare Metal and traditional on-premises infrastructure, you might consider other solutions for an end-to-end view of those applications.

Prometheus and Grafana

Prometheus and Grafana are two popular open source monitoring products available in the Cloud Marketplace:

  • Prometheus collects application and system metrics.

  • Alertmanager handles sending alerts out with several different alerting mechanisms.

  • Grafana is a dashboarding tool.

Prometheus and Grafana can be enabled on each admin cluster and user cluster. Prometheus and Grafana is recommended for application teams with prior experience with those products, or for operational teams who prefer to retain application metrics within the cluster and for troubleshooting issues when network connectivity is lost.

Third-party solutions

Google has worked with several third-party logging and monitoring solution providers to help their products work well with GKE on Bare Metal. These include Datadog, Elastic, and Splunk. Additional validated third parties will be added in the future.

The following solution guides are available for using third-party solutions with GKE on Bare Metal:

How Logging and Monitoring for GKE on Bare Metal works

Cloud Logging and Cloud Monitoring are installed and activated in each cluster when you create a new admin or user cluster.

The Stackdriver agents include several components on each cluster:

  • Stackdriver Operator (stackdriver-operator-*). Manages the lifecycle for all other Stackdriver agents deployed onto the cluster.

  • Stackdriver Custom Resource. A resource that is automatically created as part of the GKE on Bare Metal installation process.

  • Stackdriver Log Aggregator (stackdriver-log-aggregator-*). A Fluentd StatefulSet that sends logs to the Cloud Logging API; if logs can't be sent, the Log Aggregator buffers the log entries, up to 200 GB, and tries to resend them for up to 24 hours. If the buffer gets full or if the Log Aggregator can't reach the Logging API for more than 24 hours, logs are dropped.

  • Stackdriver Log Forwarder (stackdriver-log-forwarder-*). A Fluentbit daemonset that forwards logs from each machine to the Stackdriver Log Aggregator.

  • Stackdriver Metadata Collector (stackdriver-metadata-agent-). A deployment that sends metadata for Kubernetes resources such as pods, deployments, or nodes to the Stackdriver Resource Metadata API; this data is used to enrich metric queries by enabling you to query by deployment name, node name, or even Kubernetes service name.

You can see all of the agents installed by Stackdriver by running the following command:

  kubectl -n kube-system get pods | grep stackdriver

The output of this command is similar to the following:

stackdriver-log-aggregator-0                  1/1     Running   0   4h31m
stackdriver-log-aggregator-1                  1/1     Running   0   4h28m
stackdriver-log-forwarder-bpf8g               1/1     Running   0   4h31m
stackdriver-log-forwarder-cht4m               1/1     Running   0   4h31m
stackdriver-log-forwarder-fth5s               1/1     Running   0   4h31m
stackdriver-log-forwarder-kw4j2               1/1     Running   0   4h29m
stackdriver-metadata-agent-cluster-level...   1/1     Running   0   4h31m
stackdriver-operator-76ddb64d57-4tcj9         1/1     Running   0   4h37m

Cloud Monitoring metrics

For a list of metrics collected by Cloud Monitoring, see Google Distributed Cloud Virtual and Google Distributed Cloud Virtual for Bare Metal metrics.

Configuring Stackdriver agents for GKE on Bare Metal

The Stackdriver agents installed with GKE on Bare Metal collect data about system components, subject to your settings and configuration, for the purposes of maintaining and troubleshooting issues with your GKE on Bare Metal clusters, in one of the following modes.

System Components Only (Default Mode)

Upon installation, Stackdriver agents are configured by default to collect logs and metrics, including performance details (for example, CPU and memory utilization), and similar metadata, for Google-provided system components. These include all workloads in the admin cluster, and for user clusters, workloads in the kube-system, gke-system, gke-connect, istio-system, and config-management- system namespaces.

Stackdriver Disabled

Stackdriver agents can be disabled completely by deleting the Stackdriver custom resource. Caution: We do not recommend that you directly manage Stackdriver custom resources.

Before you disable Stackdriver, see the support page for details about how this affects Google Cloud Support's SLAs.

To disable Stackdriver for GKE on Bare Metal:

kubectl -n kube-system delete stackdrivers stackdriver

Stackdriver agents capture data stored locally, subject to your storage and retention configuration. The data is replicated to the Google Cloud project specified at installation by using a service account that is authorized to write data to that project. Stackdriver agents can be disabled at anytime, as described earlier. Data collected by Stackdriver agents can be managed and deleted like any other metric and log data, as described in the Cloud Monitoring documentation.

Configuration requirements for Logging and Monitoring

There are several configuration requirements to enable Cloud Logging and Cloud Monitoring with GKE on Bare Metal. These steps are included in Configuring a service account for use with Logging and Monitoring on the Enabling Google services page, and in the following list:

  1. A Cloud Monitoring Workspace must be created within the Google Cloud project. This is accomplished by clicking Monitoring in Google Cloud console and following the workflow.
  2. You need to enable the following Stackdriver APIs:

  3. You need to assign the following IAM roles to the service account used by the Stackdriver agents:

    • logging.logWriter
    • monitoring.metricWriter
    • stackdriver.resourceMetadata.writer
    • monitoring.dashboardEditor

Pricing

There is no charge for Anthos system logs and metrics.

In a GKE on Bare Metal cluster, Anthos system logs and metrics include the following:

  • Logs and metrics from all components in an admin cluster
  • Logs and metrics from components in these namespaces in a user cluster: kube-system, gke-system, gke-connect, knative-serving, istio-system, monitoring-system, config-management-system, gatekeeper-system, cnrm-system

For more information, see Pricing for Google Cloud Observability.

To learn about credit for Cloud Logging metrics, contact sales for pricing.