Observability overview

Observability refers to system monitoring, logging, alerting, and other tracking information for viewing the status and health of infrastructure and services. Observability components of Google Distributed Cloud (GDC) air-gapped appliance collect logs and metrics that become visible in Grafana dashboards and that you can query to spot operational issues.

Platform Administrators can use the Observability platform to monitor system and user clusters and visualize logs and metrics in the Grafana user interface (UI). Application Operators can collect monitoring and operational data in the form of logs, metrics, and events for their applications.

The Observability platform deploys its stack components in the admin and user clusters. The Grafana instance for Platform Administrators includes organization-level metrics, such as CPU utilization and storage consumption, and alerts, logs, and metrics from the operable components of admin, system, and user clusters in GDC.

The Grafana instance for Application Operators does not include any default dashboards or logs for your project. When you create dashboards, they are visible only when you enable metrics collection for your project.

Platform components

The GDC monitoring and logging stacks include open source services as part of the Observability platform. These services collect logs from Kubernetes Pods, bare metal machines, network switches, and storage appliances.

Review the following table for details on each Observability component.

Component Type Cluster Description
anthos-prometheus-k8s StatefulSet System only Prometheus (https://prometheus.io/docs/introduction/overview ):
A time-series database for collecting and storing metrics and evaluating alerts. It adds labels as key-value pairs and collects metrics from Kubernetes nodes, Pods, bare metal machines, network switches, and storage appliances. The database stores metrics from the user cluster in the same cluster and aggregates metrics from all clusters in the admin cluster.
grafana StatefulSet System only Grafana (https://grafana.com/docs/grafana/latest/):
A user interface for visualizing dashboards of metrics and alerts. View metrics that Prometheus collects and query logs from Loki. It lets users visualize dashboards of metrics and alerts.
alertmanager StatefulSet System only Alertmanager (https://prometheus.io/docs/alerting/latest/alertmanager/):
A user-defined manager that sends alerts when logs or metrics indicate that system components are failing or are not operating normally. It manages Prometheus alerts routing, silencing, and aggregation.
loki StatefulSet System only Loki (https://grafana.com/docs/loki/latest/):
A time-series database to store logs from various components and aggregate logs from all clusters.
audit-logs-loki StatefulSet System only Loki:
A secondary instance for collecting long-term logs necessary for audit purposes. It aggregates logs from all clusters.
anthos-log-forwarder DaemonSet All clusters Fluent Bit (https://docs.fluentbit.io/manual):
A processor that pulls logs from various components and injects them into Loki. It gathers logs from various locations and then processes and forwards them. It runs on every node of all clusters.
anthos-audit-logs-forwarder DaemonSet All clusters Fluent Bit:
A secondary instance for loading longer living logs for audit purposes.
audit-log-failure-detector DaemonSet All clusters A GDC component that detects and reports audit log collection failures. It runs on every node of all clusters.
logmon-operator Deployment All clusters The GDC Logmon operator that deploys Observability stack components.

GDC also leverages custom resources that GKE Enterprise developed for configuring logging and monitoring. These custom resources let you configure Prometheus scrape targets and alerting rules, Alertmanager configurations, Grafana dashboards, and Logs scrape targets.