The Prometheus exposition format is a convention used by many applications, especially on Kubernetes, to export their metrics. Prometheus is an open-source monitoring tool widely used to collect, store and query such metrics. We offer a few different ways to integrate Prometheus-style metrics with Cloud Monitoring.
Approach | Support status | Notes |
---|---|---|
Google Cloud Managed Service for Prometheus (Recommended) |
Self-deployed
version is available in all environments Managed collection is available in all Kubernetes environments. Managed collection via gcloud is available in GKE versions 1.22 and above
|
Offers full compatibility with the open-source ecosystem via PromQL Priced lower than external metrics |
Prometheus server with Stackdriver collector |
Deprecated Support depends on Prometheus server version |
Prometheus-style metrics are ingested as external metrics |
GKE workload metrics |
Deprecated Supported in GKE versions 1.20.8-gke.2100 through 1.23 but will be discontinued from 1.24 onwards |
Not chargeable during deprecation |
The rest of this page describes how to configure and use Prometheus with Cloud Operations for GKE using the Stackdriver collector.
The source code for the integration is publicly available.
Before you begin
Ensure that you have already created a GKE cluster with Cloud Operations for GKE enabled and installed a Prometheus server.
Prometheus does not provide built-in support for Windows Server. As a workaround, you can deploy the Prometheus server in an additional Linux node pool to capture the Windows metrics and send the metrics back to the Stackdriver collector in the Linux node pool.
Prior to installing the Stackdriver collector, carefully review these requirements:
You must be running a compatible Prometheus server and have configured it to monitor the applications in your cluster. To learn how to install Prometheus on your cluster, refer to the Prometheus Getting Started guide.
You must have configured your cluster to use Cloud Operations for GKE. For instructions, see Installing Cloud Operations for GKE.
You must have the Kubernetes Engine Cluster Admin role for your cluster. For more information, see GKE roles.
You must ensure that your service account has the proper permissions. For more information, see Use Least Privilege Service Accounts for your Nodes.
Installing the collector
To deploy the Stackdriver collector, do the following:
Identify the object to be updated by its name and controller type. Only the controller types of
deployment
andstatefulset
are supported.Set the following environment variables:
KUBE_NAMESPACE
: Namespace to run the script against.KUBE_CLUSTER
: Sidecar's cluster name parameter.GCP_REGION
: Sidecar's Google Cloud region parameter.GCP_PROJECT
: Sidecar's Google Cloud project parameter.DATA_DIR
: Sidecar's data directory. This is the directory that houses the shared volume that your Prometheus server writes to. In the subsequent instructions, this variable is set to the value/data
.DATA_VOLUME
: Name of the shared volume in theDATA_DIR
that contains Prometheus's data. In the subsequent instructions, this variable is set todata-volume
.SIDECAR_IMAGE_TAG
: Docker image version for the Prometheus sidecar. The latest release can be found in the Container registry.
Execute the following script and supply the two parameters identified in the initial step of this procedure:
After successful execution of the script, the Stackdriver collector
is added as a sidecar to the pods for the object
identified in step one of the procedure.
The two lines in the script that are commented out aren't relevant to the
collection of metric data from GKE clusters. However,
these two lines are relevant when you want to populate a generic
MonitoredResource
.
There are additional steps you must take to make the configuration changes permanent. These steps are described in subsequent sections.
Validating the installation
To validate the Stackdriver collector installation, run the following command:
kubectl -n "${KUBE_NAMESPACE}" get <deployment|statefulset> <name> -o=go-template='{{$output := "stackdriver-prometheus-sidecar does not exist."}}{{range .spec.template.spec.containers}}{{if eq .name "sidecar"}}{{$output = (print "stackdriver-prometheus-sidecar exists. Image: " .image)}}{{end}}{{end}}{{printf $output}}{{"\n"}}'
When the Prometheus sidecar is successfully installed, the output of the script lists the image used from the container registry. In the following example, the image version is 0.4.3. In your installation, the version might be different:
stackdriver-prometheus-sidecar exists. Image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.4.3
Otherwise, the output of the script shows:
stackdriver-prometheus-sidecar does not exist.
To determine if your workload is up-to-date and available, run:
kubectl -n "${KUBE_NAMESPACE}" get <deployment|statefulset> <name>
Making the configuration change permanent
After verifying that the collector is successfully installed, update your cluster configuration to make the changes permanent:
Configure the Prometheus server to write to a shared volume. In the following example steps, it is assumed that
DATA_DIR
was set to/data
andDATA_VOLUME
was set todata-volume
:Ensure that there is a shared volume in the Prometheus pod:
volumes: - name: data-volume emptyDir: {}
Have Prometheus mount the volume under
/data
:volumeMounts: - name: data-volume mountPath: /data
Instruct the Prometheus server to write to the shared volume in
/data
by adding the following to its containerargs
:--storage.tsdb.path=/data
Using the tools you use to manage the configuration of your workloads, re-apply the configuration to the cluster and include the Stackdriver collector container as a sidecar in the new configuration:
- name: sidecar image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:[SIDECAR_IMAGE_TAG] args: - "--stackdriver.project-id=[GCP_PROJECT]" - "--prometheus.wal-directory=/data/wal" - "--prometheus.api-address=[API_ADDRESS]" - "--stackdriver.kubernetes.location=[GCP_REGION]" - "--stackdriver.kubernetes.cluster-name=[KUBE_CLUSTER]" ports: - name: sidecar containerPort: 9091 volumeMounts: - name: data-volume mountPath: /data
In the previous expression,
[API_ADDRESS]
refers to Prometheus's API address, which is typicallyhttp://127.0.0.1:9090
.
For additional configuration details for the collector, refer to the Stackdriver Prometheus sidecar documentation.
Viewing metrics
Prometheus is configured to export metrics to Google Cloud's operations suite as external metrics.
To view these metrics:
In the Google Cloud console, select Monitoring:
In the Monitoring navigation pane, click Metrics Explorer.
In the Find resource type and metric menu:
- Select Kubernetes Container (
k8s_container
) for the Resource type. - For the Metric field, select one with the prefix
external/prometheus/
. For example, you might selectexternal/prometheus/go_memstats_alloc_bytes
.
In the following example, a filter was added to display the metrics for a specific cluster. Filtering by cluster name is useful when you have multiple clusters:
- Select Kubernetes Container (
Managing costs for Prometheus-derived metrics
Typically, Prometheus is configured to collect all the metrics exported by your application, and, by default, the Stackdriver collector sends these metrics to Cloud Monitoring. This collection includes metrics exported by libraries that your application depends on. For instance, the Prometheus client library exports many metrics about the application environment.
You can configure
filters
in the Stackdriver collector to select what metrics get ingested
into Cloud Monitoring. For example, to import only those
metrics generated by kubernetes-pods
and kubernetes-service-endpoints
,
add the following --include
statement when starting the
stackdriver-prometheus-sidecar:
--include={job=~"kubernetes-pods|kubernetes-service-endpoints"}
For more information, see Stackdriver Prometheus sidecar documentation.
You can also estimate how much these metrics contribute to your bill.
Prometheus integration issues
No data shows up in Cloud Monitoring.
If no data shows up in Cloud Monitoring after you went through the installation steps, search the collector logs for error messages.
If the logs don't contain any obvious failure messages, turn on debug
logging by passing --log.level=debug
flag to the collector.
You must restart the collector for the logging change to take effect.
After restarting the collector, search the collector logs for error messages.
To verify that data is sent to Cloud Monitoring, you can send the requests to
files using the --stackdriver.store-in-files-directory
command line parameter
and then inspect the files in this directory.
Permission denied
If you see permission denied errors from Monitoring API, review the requirements described in Before You Begin. Be sure your service accounts have the right permission. If you use workload identity, be sure you create a relationship between KSAs and GSAs.
I'm using recording rules and the metrics don't appear in Cloud Monitoring.
When you are using recording roles, if possible ingest the raw metric into Cloud Monitoring and use Cloud Monitoring's features to aggregate the data when you create a chart or dashboard.
If ingesting the raw metric isn't an option, add a static_metadata
entry in the collector's config.
This option requires you to preserve the job
and instance
labels. For
instance, the current configuration is valid:
Your Prometheus server configuration:
groups: - name: my-groups rules: - record: backlog_avg_10m expr: avg_over_time(backlog_k8s[10m]) - record: backlog_k8s expr: sum(total_lag) by (app, job, instance)
Your Prometheus collector configuration:
static_metadata: - metric: backlog_avg_10m type: gauge
Recording rules that change or remove either the job
or instance
labels aren't supported.
My metrics are missing the job
and instance
Prometheus labels.
The Stackdriver collector for Prometheus constructs a Cloud Monitoring MonitoredResource for your Kubernetes objects from well-known Prometheus labels. When you change the label descriptors the collector isn't able to write the metrics to Monitoring.
I see "duplicate time series" or "out-of-order writes" errors in the logs.
These errors are caused by writing metric data twice to the same time series. They occur when your Prometheus endpoints use the same metric twice from a single Cloud Monitoring monitored resource.
For example, a Kubernetes container might send Prometheus metrics on multiple
ports. Since the Monitoring k8s_container
monitored resource
doesn't differentiate resources based on port, Monitoring
detects you are writing two points to the same time series.
To avoid this situation, add a metric label in Prometheus that differentiates
the time series. For example, you might use
label __meta_kubernetes_pod_annotation_prometheus_io_port
, because it
remains constant across container restarts.
I see "metric kind must be X, but is Y" errors in the logs.
These errors are caused by changing the Prometheus metric type for an existing metric descriptor. Cloud Monitoring metrics are strictly typed and don't support changing a metric's type between gauge, counter, and others.
To change a metric's type you must delete the corresponding metric descriptors and create a new descriptor. Deletion of a metric descriptor makes the existing time series data inaccessible.
I'm sure I saw Prometheus metrics types before, but now I can't find them!
Prometheus is pre-configured to export metrics to Cloud Monitoring as external metrics. When data is exported, Monitoring creates the appropriate metric descriptor for the external metric. If no data of that metric type is written for at least 24 months, the metric descriptor is subject to deletion.
There is no guarantee that unused metric descriptors are deleted after 24 months, but Monitoring reserves the right to delete any Prometheus metric descriptor that hasn't been used in the previous 24 months.
Deprecation policy
The Prometheus integration with Cloud Monitoring is subject to the
agents deprecation policy.