Version 1.6. This version is fully supported, offering the latest patches and updates for security vulnerabilities, exposures, and issues impacting Anthos clusters on VMware. Refer to the release notes for more details. This is the most recent version.

Logging and monitoring

Anthos clusters on VMware (GKE on-prem) includes multiple options for cluster logging and monitoring, including cloud-based managed services, open source tools, and validated compatibility with third-party commercial solutions. This page explains these options and provides some basic guidance on selecting the proper solution for your environment.

Options for Anthos clusters on VMware

You have several logging and monitoring options for your Anthos clusters on VMware:

  • Cloud Logging and Cloud Monitoring, enabled by in-cluster agents deployed with Anthos clusters on VMware.
  • Prometheus and Grafana, disabled by default.
  • Validated configurations with third-party solutions.

Cloud Logging and Cloud Monitoring

Google Cloud's operations suite is the built-in observability solution for Google Cloud. It offers a fully managed logging solution, metrics collection, monitoring, dashboarding, and alerting. Cloud Monitoring monitors Anthos clusters on VMware clusters in a similar way as cloud-based GKE clusters.

The in-cluster agents can be configured for the scope of monitoring and logging, as well as the level of metrics collected:

  • Scope of logging and monitoring can be set to system components only (the default) or for system components and applications
  • Level of metrics collected can be configured for an optimized set of metrics or for full metrics

See Configuring Stackdriver agents for Anthos clusters on VMware on this page for more information.

Logging and Monitoring provide an ideal solution for customers wanting a single, easy-to-configure, powerful cloud-based observability solution. We highly recommend Logging and Monitoring when running workloads only on Anthos clusters on VMware, or workloads on GKE and Anthos clusters on VMware. For applications with components running on Anthos clusters on VMware and traditional on-premises infrastructure, you might consider other solutions for an end-to-end view of those applications.

Prometheus and Grafana

Prometheus and Grafana are two popular open source monitoring products:

  • Prometheus collects application and system metrics.

  • Alertmanager handles sending alerts out with several different alerting mechanisms.

  • Grafana is a dashboarding tool.

Prometheus and Grafana can be enabled on each admin cluster and user cluster. Prometheus and Grafana is recommended for application teams with prior experience with those products, or for operational teams who prefer to retain application metrics within the cluster and for troubleshooting issues when network connectivity is lost.

Third-party solutions

Google has worked with several third-party logging and monitoring solution providers to help their products work well with Anthos clusters on VMware. These include Datadog, Elastic, and Splunk. Additional validated third parties will be added in the future.

The following solution guides are available for using third-party solutions with Anthos clusters on VMware:

How Logging and Monitoring for Anthos clusters on VMware works

Logging and metrics agents are installed and activated in each cluster when you create a new admin or user cluster.

The Stackdriver agents include several components on each cluster:

  • Stackdriver Operator (stackdriver-operator-*). Manages the lifecycle for all other Stackdriver agents deployed onto the cluster.

  • Stackdriver Custom Resource. A resource that is automatically created as part of the Anthos clusters on VMware installation process; users can change the custom resource to update values such as project ID, cluster name, and cluster location at any time.

  • Stackdriver Log Aggregator (stackdriver-log-aggregator-*). A Fluentd StatefulSet that sends logs to the Cloud Logging API; if logs can't be sent, the Log Aggregator buffers the log entries, up to 200 GB, and tries to resend them for up to 24 hours. If the buffer gets full or if the Log Aggregator can't reach the Logging API for more than 24 hours, logs are dropped.

  • Stackdriver Log Forwarder (stackdriver-log-forwarder-*). A Fluentbit daemonset that forwards logs from each machine to the Stackdriver Log Aggregator.

  • Stackdriver Metrics Collector (stackdriver-prometheus-k8s-). A Prometheus and Stackdriver Prometheus Sidecar StatefulSet that sends Prometheus metrics to the Cloud Logging API.

  • Stackdriver Metadata Collector (stackdriver-metadata-agent-). A deployment that sends metadata for Kubernetes resources such as pods, deployments, or nodes to the Stackdriver Resource Metadata API; this data is used to enrich metric queries by enabling you to query by deployment name, node name, or even Kubernetes service name.

You can see all of the agents installed by Stackdriver by running the following command:

  kubectl -n kube-system get pods | grep stackdriver

The output of this command is similar to the following:

stackdriver-log-aggregator-0                  1/1     Running   0   4h31m
stackdriver-log-aggregator-1                  1/1     Running   0   4h28m
stackdriver-log-forwarder-bpf8g               1/1     Running   0   4h31m
stackdriver-log-forwarder-cht4m               1/1     Running   0   4h31m
stackdriver-log-forwarder-fth5s               1/1     Running   0   4h31m
stackdriver-log-forwarder-kw4j2               1/1     Running   0   4h29m
stackdriver-metadata-agent-cluster-level...   1/1     Running   0   4h31m
stackdriver-operator-76ddb64d57-4tcj9         1/1     Running   0   4h37m
stackdriver-prometheus-k8s-0                  2/2     Running   0   4h31m

Configuring Stackdriver agents for Anthos clusters on VMware

The Stackdriver agents installed with Anthos clusters on VMware collect data about system components, subject to your settings and configuration, for the purposes of maintaining and troubleshooting issues with your clusters.

System components only (default scope)

Upon installation, Stackdriver agents collect logs and metrics, including performance details (for example, CPU and memory utilization) and similar metadata, for Google-provided system components. These include all workloads in the admin cluster, and for user clusters, workloads in the kube-system, gke-system, gke-connect, istio-system, and config-management-system namespaces. The Stackdriver agents can be configured or disabled as described in the following sections.

The scope of logs and metrics collected can be expanded to include applications, as well. For instructions to enable application logging and monitoring, see Enabling Logging and Monitoring for user applications.

Optimized metrics (default metrics)

By default, the metrics agents running in the cluster collect and report an optimized set of container and kubelet metrics to Google Cloud's operations suite (formerly Stackdriver). Fewer resources are needed to collect this optimized set of metrics, which improves overall performance and scalability. This is especially important for container-level metrics, due to the large quantity of objects to monitor.

To disable optimized metrics (not recommended), set the optimizedMetrics field to false in your Stackdriver custom resource. For more information on changing your Stackdriver custom resource, see Configuring Stackdriver component resources. All Anthos clusters on VMware metrics, including those excluded by default, are described in Anthos metrics.

Stackdriver disabled

Stackdriver agents can be disabled completely by deleting the Stackdriver custom resource. Before you disable Stackdriver, see the support page for details about how this affects Google Cloud Support's SLAs.

To disable Stackdriver for Anthos clusters on VMware:

kubectl -n kube-system delete stackdriver stackdriver

Stackdriver agents capture data stored locally, subject to your storage and retention configuration. The data is replicated to the Google Cloud project specified at installation by using a service account that is authorized to write data to that project. Stackdriver agents can be disabled at any time, as described earlier. Data collected by Stackdriver agents can be managed and deleted like any other metric and log data, as described in the Cloud Monitoring documentation.

Configuration requirements for Logging and Monitoring

To use Logging and Monitoring with a cluster, you must configure the Cloud project where you want to view logs and metrics. This Cloud project is called your logging-monitoring project.

  1. Create a Cloud Monitoring Workspace in your logging-monitoring project.

    For instructions on how to create a Cloud Monitoring Workspace, see Creating a Workspace.

  2. Enable the following APIs in your logging-monitoring project:

  3. Grant following IAM roles to your logging-monitoring service account on your logging-monitoring project.

    • logging.logWriter
    • monitoring.metricWriter
    • stackdriver.resourceMetadata.writer
    • monitoring.dashboardEditor

Pricing

There is no charge for Anthos system logs and metrics.

In a Anthos clusters on VMware cluster, Anthos system logs and metrics include the following:

  • Logs and metrics from all components in an admin cluster
  • Logs and metrics from components in these namespaces in a user cluster: kube-system, gke-system, gke-connect, knative-serving, istio-system, monitoring-system, config-management-system, gatekeeper-system, cnrm-system

For more information, see Pricing for Google Cloud's operations suite.

To learn about credit for Cloud Logging metrics, contact sales for pricing.

How Prometheus and Grafana for Anthos clusters on VMware work

Each Anthos clusters on VMware cluster is created with Prometheus and Grafana disabled by default. You can follow the installation guide to enable them.

The Prometheus Server is set up in a highly-available configuration with two replicas running on two separate nodes. Resource requirements are adjusted to support clusters running up to five nodes, with each handling up to 30 Pods that serve custom metrics. Prometheus has a dedicated PersistentVolume with disk space preallocated to fit data for a retention period of four days plus an added safety buffer.

The admin control plane, as well as each user cluster, has a dedicated monitoring stack that you can configure independently. Each admin and user cluster includes a monitoring stack that delivers a full set of features: Prometheus Server for monitoring, Grafana for observability, and Prometheus Alertmanager for alerting.

All monitoring endpoints, transferred metric data, and monitoring APIs are secured with Istio components by using mTLS and RBAC rules. Access to monitoring data is restricted only to cluster administrators.

Metrics collected by Prometheus

Prometheus collects the following metrics and metadata from the admin control plane and user clusters:

  • Resource usage, such as CPU utilization on Pods and nodes.
  • Kubernetes control plane metrics.
  • Metrics from add-ons and Kubernetes system components running on nodes, such as kubelet.
  • Cluster state, such as health of Pods in a Deployment.
  • Application metrics.
  • Machine metrics, such as network, entropy, and inodes.

Multi-cluster monitoring

The Prometheus and Grafana instance installed on the admin cluster is specially configured to provide insight across the entire Anthos clusters on VMware instance, including the admin cluster and each user cluster. This enables you to:

  • Use a Grafana dashboard to access metrics from all user clusters and admin clusters.
  • View metrics from individual user clusters on Grafana dashboards; the metrics are available for direct queries in full resolution.
  • Access user clusters' node-level and workload metrics for aggregated queries, dashboards and alerting (workload metrics are limited to workloads running in the kube-system namespace).
  • Configure alerts for specific clusters.

What's next