White-box app monitoring for GKE with Prometheus

This article discusses methodologies for using Prometheus and Cloud Monitoring to do white-box monitoring for large and complex apps across many Google Kubernetes Engine (GKE) namespaces, clusters, and Google Cloud projects.

It's challenging to operate in production at scale with many microservices. To address this challenge, site reliability engineers (SREs) rely on data from multiple sources, with various levels of granularity. You can use this data to improve a service for a better user experience. You can also use the data to diagnose issues that might occur with a service—for example, debugging a code deployment. Although GKE can help you deploy microservices architectures, a scalable and dynamic monitoring system is necessary to operate those microservices efficiently and reliably.

With Cloud Operations for GKE, you can use the same tool to do both black-box and white-box monitoring. White-box monitoring of an app involves using metrics and signals from inside your app, such as current memory consumption or request latency, to detect and predict issues. Black-box monitoring uses externally visible behavior as a signal, for example, page load time.

The State of DevOps reports identified capabilities that drive software delivery performance. This article will help you with the following capabilities:

Prometheus

Prometheus is an open source monitoring and alerting toolkit that was influenced by a Google internal monitoring system: Borgmon. Borg inspired the Kubernetes open source project, and Borgmon inspired Prometheus. The tools work well with each other.

With Prometheus, you can configure scrape targets that are queried (or scraped) at configurable intervals to discover and pull in metrics from fleets of machines. Scrape targets are generally HTTP endpoints exposed from an app, using a well-defined exposition format with one metric per line. By using HTTP as the default transmission mechanism for a scrape target, you can expose metrics from many different languages and endpoints.

Client libraries for popular programming languages are available so you can quickly add metric endpoints to existing apps. Many cloud native tools, such as Istio and etcd, expose Prometheus metrics. Additionally, there are many Prometheus exporters that expose Prometheus metrics for common apps and frameworks, such as Consul, MySQL, and Redis.

In the following architecture diagram, metrics collected from a scrape target are stored in the time-series database of Prometheus.

Storing metrics in a time-series database

This database is optimized for the high-cardinality and high-throughput metric use cases that are common for large systems. When the metrics are stored, you can use a rich query language to manipulate the metrics into a format that is convenient for you. The language's syntax lets you use basic operators and more complex functions in a native way.

Monitoring GKE with Prometheus

In GKE, apps are deployed in groups of containers called pods. Each pod can expose multiple ports, and you can group those pods for load balancing by means of a Kubernetes Service.

When an app uses a Prometheus-compatible library, the library exposes an additional port and endpoint—by default, 9090 and /metrics. You can configure Prometheus to automatically detect apps in GKE by setting the Kubernetes SD scrape config. This config enables Prometheus to query the Kubernetes API to discover new possible scrape targets without additional configuration. However, you can configure multiple scrape-target discovery mechanisms:

  • Pods: A target for each pod.
  • Services: A target for each service IP and port combination.
  • Endpoints: A target for each endpoint resource.
  • Nodes: A target for each GKE node.
  • Ingress: A target for each host in an ingress specification.

When Prometheus discovers targets, it scrapes them to get the raw metric data. Before storing the data, Prometheus adds labels to the metrics based on the information it received from the GKE API. For example, Prometheus can enrich a metric that it scraped from a pod by adding labels that store the namespace where the pod is running, the pod's name, and any labels that you added to the pod.

After configuring Prometheus with information about your GKE cluster, you need to add an annotation to your Kubernetes resource definitions so that Prometheus will begin scraping your services, pods, or ingresses. Prometheus is able to discover scrape targets (endpoints, pods) when your services have this annotation. Usually, prometheus.io/scrape: "true" is used, but you can configure any key.

For example, the following is a service that scrapes the default port (9090), but scrapes a custom path (/stats):

apiVersion: v1
kind: Service
metadata:
 annotations:
   prometheus.io/scrape: 'true'
   prometheus.io/path: '/stats'
 labels:
   app: demo
 name: demo
spec:
 ports:
 - port: 8080
   protocol: TCP
   targetPort: 8080
 - port: 9090
   protocol: TCP
   targetPort: 9090
 selector:
   app: demo

See another example of Prometheus configured to scrape GKE.

In the following diagram, Prometheus is configured to scrape both pods and services.

Prometheus configured to scrape pods and services

Instrumenting your apps with OpenCensus

As mentioned previously, using Prometheus to monitor your apps requires that your apps expose data on an HTTP endpoint in the correct format. The Prometheus helper libraries make adding these endpoints efficient and idiomatic.

OpenCensus is a vendor-agnostic set of libraries that you can use to export metrics and traces from your apps. OpenCensus provides libraries in Java, Go, Node.js, and Python that let you instrument your apps to provide metrics and traces to a variety of backends, including Prometheus. A benefit of instrumenting with OpenCensus is that your app can expose zPages to give an at-a-glance view of the metrics for a server process directly from the server, regardless of the exporter used. zPages are useful during development, or for analyzing a particular process when a problem occurs in production.

If you want to focus on exposing only metrics, you can use functionality from the Prometheus client libraries.

Pairing Cloud Operations for GKE with Prometheus

To deploy a Prometheus server into your cluster, you can use Cloud Operations for GKE with Prometheus support. This server sends metrics to Cloud Monitoring as external metrics. By pairing this functionality with the Monitoring ability to aggregate metrics across projects, you can manage multiple clusters and projects.

As new apps and metrics are created, they are automatically forwarded to Cloud Monitoring and retained according to the data retention policy. If you use the Monitoring built-in functionality to centrally configure alerts and uptime checks, you can correlate data and be notified about issues across clusters and regions.

In the following architecture diagram, Prometheus runs in each cluster, enabling you to view the aggregated metrics in Cloud Monitoring.

Architecture that shows Prometheus running in each cluster, enabling you to view aggregated metrics in Cloud Monitoring

Bridging existing systems

Like most customers, you probably have existing monitoring systems to store relevant data about your running systems. The cost of migrating your app instrumentation to a new metric format or time-series database increases as the amount of instrumentation code and the number of apps increase. To reduce the complexity of migrating to Cloud Monitoring, you can use Prometheus exporters for other monitoring systems, such as Graphite or StatsD. You then use the exporter to send metrics from your existing system to an intermediary exporter process that Prometheus can scrape and send to Monitoring.

The following diagram shows an example of Graphite metrics sent from a set of Compute Engine instances to Monitoring by way of Prometheus.

Graphite metrics sent from a set of Compute Engine instances to Cloud Monitoring

What's next