Workload metrics

If your GKE cluster is using workload metrics, migrate to Managed Service for Prometheus before upgrading your cluster to GKE 1.24, since workload metrics support is removed in GKE 1.24. If you have already updated to GKE 1.24, you must disable workload metrics before making other changes to your cluster.

GKE 1.20.8-gke.2100 or later and GKE versions less than 1.24 offer a fully managed metric-collection pipeline to scrape Prometheus-style metrics exposed by any GKE workload (such as a CronJob or a Deployment for an application) and send those metrics to Cloud Monitoring.

All metrics collected by the GKE workload metrics pipeline are ingested into Cloud Monitoring with the prefix workload.googleapis.com.

Benefits of GKE workload metrics include:

  • Easy setup: With a single kubectl command to deploy a PodMonitor custom resource, you can start collecting metrics. No manual installation of an agent is required.
  • Highly configurable: Adjust scrape endpoints, frequency and other parameters.
  • Fully managed: Google maintains the pipeline, so you can focus on your applications.
  • Control costs: Easily manage Cloud Monitoring costs through flexible metric filtering.
  • Open standard: Configure workload metrics using the PodMonitor custom resource, which is modeled after the Prometheus Operator's PodMonitor resource.
  • Better pricing: More intuitive, predictable, and lower pricing.
  • Autopilot support: Supports both GKE Standard and GKE Autopilot clusters.

Requirements

GKE workload metrics requires GKE control plane version 1.20.8-gke.2100 or later and does not support GKE Windows workloads.

If you enable the workload metrics pipeline, then you must also enable the collection of system metrics.

Step 1: Enable the workload metrics pipeline

To enable the workload metrics collection pipeline in an existing GKE cluster, following these steps:

CONSOLE

  1. In the Google Cloud console, select Kubernetes Engine, and then select Clusters, or click the following button:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the row labelled "Cloud Monitoring", click the Edit icon.

  4. In the "Edit Cloud Monitoring" dialog box that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the dropdown menu, select Workloads.

  6. Click OK.

  7. Click Save Changes.

GCLOUD

  1. Open a terminal window with Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud beta container clusters update [CLUSTER_ID] \
      --zone=[ZONE] \
      --project=[PROJECT_ID] \
      --monitoring=SYSTEM,WORKLOAD
    

    Including the WORKLOAD value for the --monitoring flag of the gcloud beta container clusters create or gcloud beta container clusters update commands enables the workload metrics collection pipeline.

Enabling the workload metrics collection pipeline deploys a metrics-collection agent on each node that is capable of collecting application metrics emitted by Kubernetes workloads.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring's integration with GKE.

Step 2: Configure which metrics are collected

1) Create a PodMonitor custom resource named my-pod-monitor.yaml:


apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  # POD_MONITOR_NAME is how you identify your PodMonitor
  name: [POD_MONITOR_NAME]
  # POD_NAMESPACE is the namespace where your workload is running, and the
  # namespace of your PodMonitor object
  namespace: [POD_NAMESPACE]
spec:
  # POD_LABEL_KEY and POD_LABEL_VALUE identify which pods to collect metrics
  # from. For example, POD_LABEL_KEY of app.kubernetes.io/name and
  # POD_LABEL_VALUE of mysql would collect metrics from all pods with the label
  # app.kubernetes.io/name=mysql
  selector:
    matchLabels:
      [POD_LABEL_KEY]: [POD_LABEL_VALUE]
  podMetricsEndpoints:
    # CONTAINER_PORT_NAME is the name of the port of the container to be scraped
    # Use the following command to list all ports exposed by the container:
    # kubectl get pod [POD_NAME] -n [POD_NAMESPACE] -o json | jq '.items[].spec.containers[].ports[]?' | grep -v null
    # If the port for your metrics does not have a name, modify your application's pod spec
    # to add a port name.
  - port: [CONTAINER_PORT_NAME]

2) Initialize the credential using Google Cloud CLI in order to set up kubectl:

gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]

3) Deploy the PodMonitor custom resource:

kubectl apply -f my-pod-monitor.yaml

Pricing

Cloud Monitoring charges for the ingestion of GKE workload metrics based on the number of samples ingested. Learn more about Cloud Monitoring pricing.

Managing costs

To manage costs, start by determining which workload metrics are most valuable for your monitoring needs. The GKE workload metrics pipeline provides fine-grained controls to achieve the right trade-off between capturing detailed metrics and keeping costs low.

Many applications expose a wide variety of Prometheus metrics, and by default, the GKE workload metrics pipeline scrapes all metrics from each selected pod every 60 seconds.

You can use the following techniques to reduce the cost of metrics:

  1. Adjusting scrape frequency: To lower the cost of metrics, we recommend reducing the scrape frequency when appropriate. For example, it's possible a business-relevant KPI might change slowly enough that it can be scraped every ten minutes. In the PodMonitor, set the interval to control scraping frequency.

  2. Filtering metrics: Identify any metrics that are not being used in Cloud Monitoring and use metricRelabelings to ensure that only metrics useful for dashboards, alerts, or SLOs are sent to Cloud Monitoring.

Here's a concrete example of a PodMonitor custom resource using both techniques:

apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  name: prom-example
  namespace: gke-workload-metrics
spec:
  selector:
    matchLabels:
      app: example
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    scheme: http

    # (1) scrape metrics less frequently than the default (once every 60s)
    interval: 10m

    metricRelabelings:

    - # (2) drop the irrelevant metric named "foo" and all metrics
      # with the prefix "bar_"
      sourceLabels: [__name__]
      regex: foo|bar_.*
      action: drop

    - # (3) keep only metrics with a subset of values for a particular label
      sourceLabels: [region]
      regex: us-.*
      action: keep

To identify which metrics have the largest number of samples being ingested, use the monitoring.googleapis.com/collection/attribution/write_sample_count metric:

  1. In the Google Cloud console, select Monitoring, and then select  Metrics Explorer, or click the following button:

    Go to Metrics Explorer

  2. In the Metric field, select monitoring.googleapis.com/collection/attribution/write_sample_count.

  3. Optionally, filter for only GKE workload metrics:

    • Click Add Filter.

    • In the Label field, select metric_domain.

    • In the Value field, enter workload.googleapis.com.

    • Click Done.

  4. Optionally, group the number of samples ingested by Kubernetes namespace, GKE region, Google Cloud project, or the monitored resource type:

    • Click Group by.

    • To group by Kubernetes namespace, select attribution_dimension and attribution_id.

    • To group by GKE region, select location.

    • To group by Google Cloud project, select resource_container.

    • To group by monitored resource type, select resource_type.

  5. Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since GKE workload metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested. Consider whether you can reduce the scrape frequency of any of these metrics or whether you can stop collecting any of them.

Finally, there are many resources available to understand the cost of GKE metrics ingested into Cloud Monitoring and to optimize those costs. See the cost optimization guide for additional ways to reduce the cost of Cloud Monitoring.

Earlier GKE versions

To collect application metrics in a cluster with a GKE control plane version less than 1.20.8-gke.2100, use Stackdriver Prometheus sidecar.

Troubleshooting

If metrics are not available in Cloud Monitoring as expected, use the steps in the following sections to troubleshoot.

Troubleshooting workload metrics

If workload metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm your cluster meets minimum requirements

Ensure that your GKE cluster is running control plane version 1.20.8-gke.2100 or later. If not, upgrade your cluster's control plane.

Confirm workload metrics is enabled

Ensure that your GKE cluster is configured to enable the workload metrics collection pipeline by following these steps:

CONSOLE

  1. In the Google Cloud console, select Kubernetes Engine, and then select Clusters, or click the following button:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the Details panel for your cluster, confirm that "Workload" is included in the status for Cloud Monitoring. If "Workload" is not shown here, then enable workload metrics.

GCLOUD

  1. Open a terminal window with gcloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud container clusters describe [CLUSTER_ID] --zone=[ZONE]
    

    In the output of that command, look for the monitoringConfig: line. A couple lines later, confirm that the enableComponents: section includes WORKLOADS. If WORKLOADS is not shown here, then enable workload metrics.

Confirm metrics are collected from your application

If you are not able to view your application's metrics in Cloud Monitoring, the steps below can help troubleshoot issues. These steps use a sample application for demonstratation purposes, but you can apply the same steps below to troubleshoot any application.

The first few steps below deploy a sample application, which you can use for testing. The remaining steps demonstrate steps to follow to troubleshoot why metrics from that application might not be appearing in Cloud Monitoring.

  1. Create a file named prometheus-example-app.yaml containing the following:

    # This example application exposes prometheus metrics on port 1234.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: prom-example
      name: prom-example
      namespace: gke-workload-metrics
    spec:
      selector:
        matchLabels:
          app: prom-example
      template:
        metadata:
          labels:
            app: prom-example
        spec:
          containers:
          - image: nilebox/prometheus-example-app@sha256:dab60d038c5d6915af5bcbe5f0279a22b95a8c8be254153e22d7cd81b21b84c5
            name: prom-example
            ports:
            - name: metrics-port
              containerPort: 1234
            command:
            - "/main"
            - "--process-metrics"
            - "--go-metrics"
    
  2. Initialize the credential using Google Cloud CLI in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  3. Create the gke-workload-metrics namespace:

    kubectl create ns gke-workload-metrics
    
  4. Deploy the example application:

    kubectl apply -f prometheus-example-app.yaml
    
  5. Confirm the example application is running by running this command:

    kubectl get pod -n gke-workload-metrics -l app=prom-example
    

    The output should look similar to this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-775d8685f4-ljlkd   1/1     Running   0          25m
    
  6. Confirm your application is exposing metrics as expected

    Using one of the pods returned from the above command, check that the metrics endpoint is working correctly:

    POD_NAME=prom-example-775d8685f4-ljlkd
    NAMESPACE=gke-workload-metrics
    PORT_NUMBER=1234
    METRICS_PATH=/metrics
    kubectl get --raw /api/v1/namespaces/$NAMESPACE/pods/$POD_NAME:$PORT_NUMBER/proxy/$METRICS_PATH
    

    If using the example application above, you should receive output similar to this:

    # HELP example_random_numbers A histogram of normally distributed random numbers.
    # TYPE example_random_numbers histogram
    example_random_numbers_bucket{le="0"} 501
    example_random_numbers_bucket{le="0.1"} 1.75933011554e+11
    example_random_numbers_bucket{le="0.2"} 3.50117676362e+11
    example_random_numbers_bucket{le="0.30000000000000004"} 5.20855682325e+11
    example_random_numbers_bucket{le="0.4"} 6.86550977647e+11
    example_random_numbers_bucket{le="0.5"} 8.45755380226e+11
    example_random_numbers_bucket{le="0.6"} 9.97201199544e+11
    ...
    
  7. Create a file named my-pod-monitor.yaml containing the following:

    # Note that this PodMonitor is in the monitoring.gke.io domain,
    # rather than the monitoring.coreos.com domain used with the
    # Prometheus Operator.  The PodMonitor supports a subset of the
    # fields in the Prometheus Operator's PodMonitor.
    apiVersion: monitoring.gke.io/v1alpha1
    kind: PodMonitor
    metadata:
      name: example
      namespace: gke-workload-metrics
    # spec describes how to monitor a set of pods in a cluster.
    spec:
      # selector determines which pods are monitored.  Required
      # This example matches pods with the `app: prom-example` label
      selector:
        matchLabels:
          app: prom-example
      podMetricsEndpoints:
        # port is the name of the port of the container to be scraped.
      - port: metrics-port
        # path is the path of the endpoint to be scraped.
        # Default /metrics
        path: /metrics
        # scheme is the scheme of the endpoint to be scraped.
        # Default http
        scheme: http
        # interval is the time interval at which metrics should
        # be scraped. Default 60s
        interval: 20s
    
  8. Create this PodMonitor resource:

    kubectl apply -f my-pod-monitor.yaml
    

    Once you have created the PodMonitor resource, the GKE workload metrics pipeline will detect appropriate pods and will automatically start scraping them periodically. The pipeline will send collected metrics to Cloud Monitoring

  9. Confirm the label and namespace is set correctly in your PodMonitor. Update the values of NAMESPACE and SELECTOR below to reflect the namespace and matchLabels in your PodMonitor custom resource. Then run this command:

    NAMESPACE=gke-workload-metrics
    SELECTOR=app=prom-example
    kubectl get pods --namespace $NAMESPACE --selector $SELECTOR
    

    You should see a result like this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-7cff4db5fc-wp8lw   1/1     Running   0          39m
    
  10. Confirm the PodMonitor is in the Ready state.

    Run this command to return all of the PodMonitors you have installed in your cluster:

    kubectl get podmonitor.monitoring.gke.io --all-namespaces
    

    You should see output similar to this:

    NAMESPACE              NAME      AGE
    gke-workload-metrics   example   2m36s
    

    Identify the relevant PodMonitor from the set returned and run this command (replacing example in the command below with the name of your PodMonitor):

    kubectl describe podmonitor.monitoring.gke.io example --namespace gke-workload-metrics
    

    Examine the results returned by kubectl describe and confirm the "Ready" condition is True. If the Ready is False, look for events that indicate why it isn't Ready.

  11. Next, we'll confirm these metrices are indeed received by Cloud Monitoring. In the Cloud Monitoring section of the Google Cloud console, go to Metrics explorer.

  12. In the Metric field, type example_requests_total.

  13. In the dropdown menu that appears, select workload.googleapis.com/example_requests_total.

    This example_requests_total metric is one of the Prometheus metrics emitted by the example application.

    If the dropdown menu doesn't appear or if you don't see workload.googleapis.com/example_requests_total in the dropdown menu, try again in a few minutes.

    All metrics are associated with the Kubernetes Container (k8s_container) resource they are collected from. You can use the Resource type field of Metrics Explorer to select k8s_container. You can also group by any labels such as namespace_name or pod_name.

    This metric can be used anywhere within Cloud Monitoring or queried via the Cloud Monitoring API. For example, to add this chart to an existing or new dashboard, click on the Save Chart button in the top right corner and select the desired dashboard in a dialog window.

Check for errors sending to the Cloud Monitoring API

Metrics sent to Cloud Monitoring need to stay within the custom metric limits. Errors will show up in the Cloud Monitoring audit logs.

  1. In the Google Cloud console, select Logging, and then select Logs Explorer, or click the following button:

    Go to the Logs Explorer

  2. Run the following query, after replacing PROJECT_ID with your projects's ID:

    resource.type="audited_resource"
    resource.labels.service="monitoring.googleapis.com"
    protoPayload.authenticationInfo.principalEmail=~".*-compute@developer.gserviceaccount.com"
    resource.labels.project_id="PROJECT_ID"
    severity>=ERROR
    

Note that this will show errors for all writes to Cloud Monitoring for the project, not just those from your cluster.

Check your Cloud Monitoring ingestion Quota

  1. Go to the Cloud Monitoring API Quotas page
  2. Select the relevant project
  3. Expand the Time series ingestion requests section
  4. Confirm that the Quota exceeded errors count for "Time series ingestion requests per minute" is 0. (If this is the case, the "Quota exceeded errors count" graph will indicate "No data is available for the selected time frame.")
  5. If the peak usage percentage exceeds 100%, then consider selecting fewer pods, selecting fewer metrics, or requesting a higher quota limit for the monitoring.googleapis.com/ingestion_requests quota.

Confirm the DaemonSet is deployed

Ensure that the workload metrics DaemonSet is deployed in your cluster. You can validate this is working as expected by using kubectl:

  1. Initialize the credential using Google Cloud CLI in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  2. Check that the workload-metrics DaemonSet is present and healthy. Run this:

    kubectl get ds -n kube-system workload-metrics
    

    If components were deployed successfully, you will see something similar to the following output:

    NAME               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   AGE
    workload-metrics   3         3         3       3            3           2m
    

    The number of replicas above should match the number of Linux GKE nodes in your cluster. For example, a cluster with 3 nodes will have DESIRED = 3. After a few minutes, READY and AVAILABLE numbers should match the DESIRED number. Otherwise, there might be an issue with deployment.