If your GKE cluster is using workload metrics, migrate to Managed Service for Prometheus before upgrading your cluster to GKE 1.24, since workload metrics support is removed in GKE 1.24. If you have already updated to GKE 1.24, you must disable workload metrics before making other changes to your cluster.
GKE 1.20.8-gke.2100 or later and GKE versions less than 1.24 offer a fully managed metric-collection pipeline to scrape Prometheus-style metrics exposed by any GKE workload (such as a CronJob or a Deployment for an application) and send those metrics to Cloud Monitoring.
All metrics collected by the GKE workload metrics pipeline are ingested into
Cloud Monitoring with the prefix workload.googleapis.com
.
Benefits of GKE workload metrics include:
- Easy setup: With a single
kubectl
command to deploy a PodMonitor custom resource, you can start collecting metrics. No manual installation of an agent is required. - Highly configurable: Adjust scrape endpoints, frequency and other parameters.
- Fully managed: Google maintains the pipeline, so you can focus on your applications.
- Control costs: Easily manage Cloud Monitoring costs through flexible metric filtering.
- Open standard: Configure workload metrics using the PodMonitor custom resource, which is modeled after the Prometheus Operator's PodMonitor resource.
- Better pricing: More intuitive, predictable, and lower pricing.
- Autopilot support: Supports both GKE Standard and GKE Autopilot clusters.
Requirements
GKE workload metrics requires GKE control plane version 1.20.8-gke.2100 or later and does not support GKE Windows workloads.
If you enable the workload metrics pipeline, then you must also enable the collection of system metrics.
Step 1: Enable the workload metrics pipeline
To enable the workload metrics collection pipeline in an existing GKE cluster, following these steps:
CONSOLE
-
In the Google Cloud console, select Kubernetes Engine, and then select Clusters, or click the following button:
Click your cluster's name.
In the row labelled "Cloud Monitoring", click the Edit icon.
In the "Edit Cloud Monitoring" dialog box that appears, confirm that Enable Cloud Monitoring is selected.
In the dropdown menu, select Workloads.
Click OK.
Click Save Changes.
GCLOUD
Open a terminal window with Google Cloud CLI installed. One way to do this is to use Cloud Shell.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Run this command:
gcloud beta container clusters update [CLUSTER_ID] \ --zone=[ZONE] \ --project=[PROJECT_ID] \ --monitoring=SYSTEM,WORKLOAD
Including the
WORKLOAD
value for the--monitoring
flag of thegcloud beta container clusters create
orgcloud beta container clusters update
commands enables the workload metrics collection pipeline.
Enabling the workload metrics collection pipeline deploys a metrics-collection agent on each node that is capable of collecting application metrics emitted by Kubernetes workloads.
See Configuring Cloud Operations for GKE for more details about Cloud Monitoring's integration with GKE.
Step 2: Configure which metrics are collected
1) Create a PodMonitor custom resource named my-pod-monitor.yaml
:
apiVersion: monitoring.gke.io/v1alpha1 kind: PodMonitor metadata: # POD_MONITOR_NAME is how you identify your PodMonitor name: [POD_MONITOR_NAME] # POD_NAMESPACE is the namespace where your workload is running, and the # namespace of your PodMonitor object namespace: [POD_NAMESPACE] spec: # POD_LABEL_KEY and POD_LABEL_VALUE identify which pods to collect metrics # from. For example, POD_LABEL_KEY of app.kubernetes.io/name and # POD_LABEL_VALUE of mysql would collect metrics from all pods with the label # app.kubernetes.io/name=mysql selector: matchLabels: [POD_LABEL_KEY]: [POD_LABEL_VALUE] podMetricsEndpoints: # CONTAINER_PORT_NAME is the name of the port of the container to be scraped # Use the following command to list all ports exposed by the container: # kubectl get pod [POD_NAME] -n [POD_NAMESPACE] -o json | jq '.items[].spec.containers[].ports[]?' | grep -v null # If the port for your metrics does not have a name, modify your application's pod spec # to add a port name. - port: [CONTAINER_PORT_NAME]
2) Initialize the credential using Google Cloud CLI in order to set up kubectl
:
gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
3) Deploy the PodMonitor custom resource:
kubectl apply -f my-pod-monitor.yaml
Pricing
Cloud Monitoring charges for the ingestion of GKE workload metrics based on the number of samples ingested. Learn more about Cloud Monitoring pricing.
Managing costs
To manage costs, start by determining which workload metrics are most valuable for your monitoring needs. The GKE workload metrics pipeline provides fine-grained controls to achieve the right trade-off between capturing detailed metrics and keeping costs low.
Many applications expose a wide variety of Prometheus metrics, and by default, the GKE workload metrics pipeline scrapes all metrics from each selected pod every 60 seconds.
You can use the following techniques to reduce the cost of metrics:
Adjusting scrape frequency: To lower the cost of metrics, we recommend reducing the scrape frequency when appropriate. For example, it's possible a business-relevant KPI might change slowly enough that it can be scraped every ten minutes. In the PodMonitor, set the
interval
to control scraping frequency.Filtering metrics: Identify any metrics that are not being used in Cloud Monitoring and use
metricRelabelings
to ensure that only metrics useful for dashboards, alerts, or SLOs are sent to Cloud Monitoring.
Here's a concrete example of a PodMonitor custom resource using both techniques:
apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
name: prom-example
namespace: gke-workload-metrics
spec:
selector:
matchLabels:
app: example
podMetricsEndpoints:
- port: metrics
path: /metrics
scheme: http
# (1) scrape metrics less frequently than the default (once every 60s)
interval: 10m
metricRelabelings:
- # (2) drop the irrelevant metric named "foo" and all metrics
# with the prefix "bar_"
sourceLabels: [__name__]
regex: foo|bar_.*
action: drop
- # (3) keep only metrics with a subset of values for a particular label
sourceLabels: [region]
regex: us-.*
action: keep
To identify which metrics have the largest number of samples being ingested,
use the monitoring.googleapis.com/collection/attribution/write_sample_count
metric:
-
In the Google Cloud console, select Monitoring, and then select leaderboard Metrics Explorer, or click the following button:
In the Metric field, select
monitoring.googleapis.com/collection/attribution/write_sample_count
.Optionally, filter for only GKE workload metrics:
Click Add Filter.
In the Label field, select
metric_domain
.In the Value field, enter
workload.googleapis.com
.Click Done.
Optionally, group the number of samples ingested by Kubernetes namespace, GKE region, Google Cloud project, or the monitored resource type:
Click Group by.
To group by Kubernetes namespace, select attribution_dimension and attribution_id.
To group by GKE region, select location.
To group by Google Cloud project, select resource_container.
To group by monitored resource type, select resource_type.
Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.
These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since GKE workload metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested. Consider whether you can reduce the scrape frequency of any of these metrics or whether you can stop collecting any of them.
Finally, there are many resources available to understand the cost of GKE metrics ingested into Cloud Monitoring and to optimize those costs. See the cost optimization guide for additional ways to reduce the cost of Cloud Monitoring.
Earlier GKE versions
To collect application metrics in a cluster with a GKE control plane version less than 1.20.8-gke.2100, use Stackdriver Prometheus sidecar.
Troubleshooting
If metrics are not available in Cloud Monitoring as expected, use the steps in the following sections to troubleshoot.
Troubleshooting workload metrics
If workload metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.
Confirm your cluster meets minimum requirements
Ensure that your GKE cluster is running control plane version 1.20.8-gke.2100 or later. If not, upgrade your cluster's control plane.
Confirm workload metrics is enabled
Ensure that your GKE cluster is configured to enable the workload metrics collection pipeline by following these steps:
CONSOLE
-
In the Google Cloud console, select Kubernetes Engine, and then select Clusters, or click the following button:
Click your cluster's name.
In the Details panel for your cluster, confirm that "Workload" is included in the status for Cloud Monitoring. If "Workload" is not shown here, then enable workload metrics.
GCLOUD
Open a terminal window with gcloud CLI installed. One way to do this is to use Cloud Shell.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Run this command:
gcloud container clusters describe [CLUSTER_ID] --zone=[ZONE]
In the output of that command, look for the
monitoringConfig:
line. A couple lines later, confirm that theenableComponents:
section includesWORKLOADS
. IfWORKLOADS
is not shown here, then enable workload metrics.
Confirm metrics are collected from your application
If you are not able to view your application's metrics in Cloud Monitoring, the steps below can help troubleshoot issues. These steps use a sample application for demonstratation purposes, but you can apply the same steps below to troubleshoot any application.
The first few steps below deploy a sample application, which you can use for testing. The remaining steps demonstrate steps to follow to troubleshoot why metrics from that application might not be appearing in Cloud Monitoring.
Create a file named
prometheus-example-app.yaml
containing the following:# This example application exposes prometheus metrics on port 1234. apiVersion: apps/v1 kind: Deployment metadata: labels: app: prom-example name: prom-example namespace: gke-workload-metrics spec: selector: matchLabels: app: prom-example template: metadata: labels: app: prom-example spec: containers: - image: nilebox/prometheus-example-app@sha256:dab60d038c5d6915af5bcbe5f0279a22b95a8c8be254153e22d7cd81b21b84c5 name: prom-example ports: - name: metrics-port containerPort: 1234 command: - "/main" - "--process-metrics" - "--go-metrics"
Initialize the credential using Google Cloud CLI in order to set up
kubectl
:gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
Create the
gke-workload-metrics
namespace:kubectl create ns gke-workload-metrics
Deploy the example application:
kubectl apply -f prometheus-example-app.yaml
Confirm the example application is running by running this command:
kubectl get pod -n gke-workload-metrics -l app=prom-example
The output should look similar to this:
NAME READY STATUS RESTARTS AGE prom-example-775d8685f4-ljlkd 1/1 Running 0 25m
Confirm your application is exposing metrics as expected
Using one of the pods returned from the above command, check that the metrics endpoint is working correctly:
POD_NAME=prom-example-775d8685f4-ljlkd NAMESPACE=gke-workload-metrics PORT_NUMBER=1234 METRICS_PATH=/metrics kubectl get --raw /api/v1/namespaces/$NAMESPACE/pods/$POD_NAME:$PORT_NUMBER/proxy/$METRICS_PATH
If using the example application above, you should receive output similar to this:
# HELP example_random_numbers A histogram of normally distributed random numbers. # TYPE example_random_numbers histogram example_random_numbers_bucket{le="0"} 501 example_random_numbers_bucket{le="0.1"} 1.75933011554e+11 example_random_numbers_bucket{le="0.2"} 3.50117676362e+11 example_random_numbers_bucket{le="0.30000000000000004"} 5.20855682325e+11 example_random_numbers_bucket{le="0.4"} 6.86550977647e+11 example_random_numbers_bucket{le="0.5"} 8.45755380226e+11 example_random_numbers_bucket{le="0.6"} 9.97201199544e+11 ...
Create a file named
my-pod-monitor.yaml
containing the following:# Note that this PodMonitor is in the monitoring.gke.io domain, # rather than the monitoring.coreos.com domain used with the # Prometheus Operator. The PodMonitor supports a subset of the # fields in the Prometheus Operator's PodMonitor. apiVersion: monitoring.gke.io/v1alpha1 kind: PodMonitor metadata: name: example namespace: gke-workload-metrics # spec describes how to monitor a set of pods in a cluster. spec: # selector determines which pods are monitored. Required # This example matches pods with the `app: prom-example` label selector: matchLabels: app: prom-example podMetricsEndpoints: # port is the name of the port of the container to be scraped. - port: metrics-port # path is the path of the endpoint to be scraped. # Default /metrics path: /metrics # scheme is the scheme of the endpoint to be scraped. # Default http scheme: http # interval is the time interval at which metrics should # be scraped. Default 60s interval: 20s
Create this PodMonitor resource:
kubectl apply -f my-pod-monitor.yaml
Once you have created the PodMonitor resource, the GKE workload metrics pipeline will detect appropriate pods and will automatically start scraping them periodically. The pipeline will send collected metrics to Cloud Monitoring
Confirm the label and namespace is set correctly in your PodMonitor. Update the values of
NAMESPACE
andSELECTOR
below to reflect thenamespace
andmatchLabels
in your PodMonitor custom resource. Then run this command:NAMESPACE=gke-workload-metrics SELECTOR=app=prom-example kubectl get pods --namespace $NAMESPACE --selector $SELECTOR
You should see a result like this:
NAME READY STATUS RESTARTS AGE prom-example-7cff4db5fc-wp8lw 1/1 Running 0 39m
Confirm the PodMonitor is in the Ready state.
Run this command to return all of the PodMonitors you have installed in your cluster:
kubectl get podmonitor.monitoring.gke.io --all-namespaces
You should see output similar to this:
NAMESPACE NAME AGE gke-workload-metrics example 2m36s
Identify the relevant PodMonitor from the set returned and run this command (replacing
example
in the command below with the name of your PodMonitor):kubectl describe podmonitor.monitoring.gke.io example --namespace gke-workload-metrics
Examine the results returned by
kubectl describe
and confirm the "Ready" condition is True. If the Ready is False, look for events that indicate why it isn't Ready.Next, we'll confirm these metrices are indeed received by Cloud Monitoring. In the Cloud Monitoring section of the Google Cloud console, go to Metrics explorer.
In the Metric field, type
example_requests_total
.In the dropdown menu that appears, select
workload.googleapis.com/example_requests_total
.This
example_requests_total
metric is one of the Prometheus metrics emitted by the example application.If the dropdown menu doesn't appear or if you don't see
workload.googleapis.com/example_requests_total
in the dropdown menu, try again in a few minutes.All metrics are associated with the Kubernetes Container (
k8s_container
) resource they are collected from. You can use the Resource type field of Metrics Explorer to selectk8s_container
. You can also group by any labels such asnamespace_name
orpod_name
.This metric can be used anywhere within Cloud Monitoring or queried via the Cloud Monitoring API. For example, to add this chart to an existing or new dashboard, click on the Save Chart button in the top right corner and select the desired dashboard in a dialog window.
Check for errors sending to the Cloud Monitoring API
Metrics sent to Cloud Monitoring need to stay within the custom metric limits. Errors will show up in the Cloud Monitoring audit logs.
-
In the Google Cloud console, select Logging, and then select Logs Explorer, or click the following button:
Run the following query, after replacing PROJECT_ID with your projects's ID:
resource.type="audited_resource" resource.labels.service="monitoring.googleapis.com" protoPayload.authenticationInfo.principalEmail=~".*-compute@developer.gserviceaccount.com" resource.labels.project_id="PROJECT_ID" severity>=ERROR
Note that this will show errors for all writes to Cloud Monitoring for the project, not just those from your cluster.
Check your Cloud Monitoring ingestion Quota
- Go to the Cloud Monitoring API Quotas page
- Select the relevant project
- Expand the Time series ingestion requests section
- Confirm that the Quota exceeded errors count for "Time series ingestion requests per minute" is 0. (If this is the case, the "Quota exceeded errors count" graph will indicate "No data is available for the selected time frame.")
- If the peak usage percentage exceeds 100%, then consider selecting fewer
pods, selecting fewer metrics, or requesting a higher quota
limit for the
monitoring.googleapis.com/ingestion_requests
quota.
Confirm the DaemonSet is deployed
Ensure that the workload metrics DaemonSet is deployed in your cluster. You can
validate this is working as expected by using kubectl
:
Initialize the credential using Google Cloud CLI in order to set up
kubectl
:gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
Check that the workload-metrics DaemonSet is present and healthy. Run this:
kubectl get ds -n kube-system workload-metrics
If components were deployed successfully, you will see something similar to the following output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE AGE workload-metrics 3 3 3 3 3 2m
The number of replicas above should match the number of Linux GKE nodes in your cluster. For example, a cluster with 3 nodes will have
DESIRED
= 3. After a few minutes,READY
andAVAILABLE
numbers should match theDESIRED
number. Otherwise, there might be an issue with deployment.