Managing GKE metrics

Google Kubernetes Engine (GKE) makes it easy to send metrics to Cloud Monitoring. Once in Cloud Monitoring, metrics can populate custom dashboards, generate alerts, create Service Level Objectives, or be fetched by 3rd party monitoring services using the Monitoring API.

G​K​E provides two sources of metrics:

  • System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
  • Workload metrics: metrics exposed by any G​K​E workload (such as a CronJob or a Deployment for an application).

System metrics

When a cluster is created, G​K​E by default collects certain metrics emitted by system components.

You have a choice whether or not to send metrics from your G​K​E cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.

All G​K​E system metrics are ingested into Cloud Monitoring with the prefix kubernetes.io.

Configuring collection of system metrics

To enable system metric collection, pass the SYSTEM value to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

To disable system metric collection, use the NONE value for the --monitoring flag. If system metric collection is disabled, basic information like CPU usage, memory usage, and disk usage are not available for a cluster in the G​K​E section of the Cloud Console. Additionally, the Cloud Monitoring G​K​E Dashboard does not contain information about the cluster.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring integration with G​K​E.

List of system metrics

System metrics include metrics from essential system components important for core Kubernetes functionality. See a complete list of these Kubernetes metrics.

Workload metrics

G​K​E workload metrics is Google's recommended way to monitor Kubernetes applications using Cloud Monitoring.

G​K​E 1.20.8-gke.2100 or later offers a fully managed metric-collection pipeline to scrape Prometheus-style metrics exposed by any G​K​E workload (such as a CronJob or a Deployment for an application) and send those metrics to Cloud Monitoring.

All metrics collected by the G​K​E workload metrics pipeline are ingested into Cloud Monitoring with the prefix workload.googleapis.com.

Benefits of G​K​E workload metrics include:

  • Easy setup: With a single kubectl command to deploy a PodMonitor custom resource, you can start collecting metrics. No manual installation of an agent is required.
  • Highly configurable: Adjust scrape endpoints, frequency and other parameters.
  • Fully managed: Google maintains the pipeline, so you can focus on your applications.
  • Control costs: Easily manage Cloud Monitoring costs through flexible metric filtering.
  • Open standard: Configure workload metrics using the PodMonitor custom resource, which is modeled after the Prometheus Operator's PodMonitor resource.
  • HPA support: Compatible with the Stackdriver Custom Metrics Adapter to enable horizontal scaling on custom metrics.
  • Better pricing: More intuitive, predictable, and lower pricing.
  • Autopilot support: Supports both G​K​E Standard and G​K​E Autopilot clusters.

Requirements

G​K​E workload metrics requires G​K​E control plane version 1.20.8-gke.2100 or later and does not support G​K​E Windows workloads.

If you enable the workload metrics pipeline, then you must also enable the collection of system metrics.

Step 1: Enable the workload metrics pipeline

To enable the workload metrics collection pipeline in an existing G​K​E cluster, following these steps:

CONSOLE

  1. In the Google Cloud Console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the row labelled "Cloud Monitoring", click the Edit icon.

  4. In the "Edit Cloud Monitoring" dialog box that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the dropdown menu, select Workloads.

  6. Click OK.

  7. Click Save Changes.

GCLOUD

  1. Open a terminal window with Cloud SDK and the gcloud command-line tool installed. One way to do this is to use Cloud Shell.

  2. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud beta container clusters update [CLUSTER_ID] \
      --zone=[ZONE] \
      --project=[PROJECT_ID] \
      --monitoring=SYSTEM,WORKLOAD
    

    Including the WORKLOAD value for the --monitoring flag of the gcloud beta container clusters create or gcloud beta container clusters update commands enables the workload metrics collection pipeline.

Enabling the workload metrics collection pipeline deploys a metrics-collection agent on each node that is capable of collecting application metrics emitted by Kubernetes workloads.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring's integration with G​K​E.

Step 2: Configure which metrics are collected

1) Create a PodMonitor custom resource named my-pod-monitor.yaml:


apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  # POD_MONITOR_NAME and POD_MONITOR_NAMESPACE is how you identify your
  # PodMonitor
  name: [POD_MONITOR_NAME]
  namespace: [POD_MONITOR_NAMESPACE]
spec:
  # POD_NAMESPACE is the namespace where your workload is running
  namespaceSelector:
    matchNames:
    - [POD_NAMESPACE]
  # POD_LABEL_KEY and POD_LABEL_VALUE identify which pods to collect metrics
  # from. For example, POD_LABEL_KEY of app.kubernetes.io/name and
  # POD_LABEL_VALUE of mysql would collect metrics from all pods with the label
  # app.kubernetes.io/name=mysql
  selector:
    matchLabels:
      [POD_LABEL_KEY]: [POD_LABEL_VALUE]
  podMetricsEndpoints:
    # CONTAINER_PORT_NAME is the name of the port of the container to be scraped
    # Use the following command to list all ports exposed by the container:
    # kubectl get pod [POD_NAME] -n [POD_NAMESPACE] -o json | jq '.items[].spec.containers[].ports[]?' | grep -v null
    # If the port for your metrics does not have a name, modify your application's pod spec
    # to add a port name.
  - port: [CONTAINER_PORT_NAME]

2) Initialize the credential using gcloud command-line tool in order to set up kubectl:

gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]

3) Deploy the PodMonitor custom resource:

kubectl apply -f my-pod-monitor.yaml

Migrating

If you are already using the Stackdriver Prometheus sidecar, G​K​E workload metrics provides numerous improvements. It is strongly recommended that you switch to using G​K​E workload metrics to collect metrics emitted by your application. See the guide for migrating from the Stackdriver Prometheus Sidecar to G​K​E workload metrics.

Managing costs

Learn about Cloud Monitoring pricing, including which metrics are non-chargeable (free).

To manage costs, start by determining which workload metrics are most valuable for your monitoring needs. The G​K​E workload metrics pipeline provides fine-grained controls to achieve the right trade-off between capturing detailed metrics and keeping costs low.

Many applications expose a wide variety of Prometheus metrics, and by default, the G​K​E workload metrics pipeline scrapes all metrics from each selected pod every 60 seconds.

You can use the following techniques to reduce the cost of metrics:

  1. Adjusting scrape frequency: To lower the cost of metrics, we recommend reducing the scrape frequency when appropriate. For example, it's possible a business-relevant KPI might change slowly enough that it can be scraped every ten minutes. In the PodMonitor, set the interval to control scraping frequency.

  2. Filtering metrics: Identify any metrics that are not being used in Cloud Monitoring and use metricRelabelings to ensure that only metrics useful for dashboards, alerts, or SLOs are sent to Cloud Monitoring.

Here's a concrete example of a PodMonitor custom resource using both techniques:

apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  name: prom-example
  namespace: gke-workload-metrics
spec:
  selector:
    matchLabels:
      app: example
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    scheme: http

    # (1) scrape metrics less frequently than the default (once every 60s)
    interval: 10m

    metricRelabelings:

    - # (2) drop the irrelevant metric named "foo" and all metrics
      # with the prefix "bar_"
      sourceLabels: [__name__]
      regex: foo|bar_.*
      action: drop

    - # (3) keep only metrics with a subset of values for a particular label
      sourceLabels: [region]
      regex: us-.*
      action: keep

Finally, there are many resources available to understand the cost of G​K​E metrics ingested into Cloud Monitoring and to optimize those costs. See the cost optimization guide for additional ways to reduce the cost of Cloud Monitoring.

Earlier G​K​E versions

To collect application metrics in a cluster with a G​K​E control plane version less than 1.20.8-gke.2100, use Stackdriver Prometheus sidecar.

Troubleshooting

If workload metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm your cluster meets minimum requirements

Ensure that your G​K​E cluster is running control plane version 1.20.8-gke.2100 or later. If not, upgrade your cluster's control plane.

Confirm workload metrics is enabled

Ensure that your G​K​E cluster is configured to enable the workload metrics collection pipeline by following these steps:

CONSOLE

  1. In the Google Cloud Console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the Details panel for your cluster, confirm that "Workload" is included in the status for Cloud Monitoring. If "Workload" is not shown here, then enable workload metrics.

GCLOUD

  1. Open a terminal window with Cloud SDK and gcloud installed. One way to do this is to use Cloud Shell.

  2. In the Cloud Console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud container clusters describe [CLUSTER_ID] --zone=[ZONE]
    

    In the output of that command, look for the monitoringConfig: line. A couple lines later, confirm that the enableComponents: section includes WORKLOADS. If WORKLOADS is not shown here, then enable workload metrics.

Confirm metrics are collected from your application

If you are not able to view your application's metrics in Cloud Monitoring, the steps below can help troubleshoot issues. These steps use a sample application for demonstratation purposes, but you can apply the same steps below to troubleshoot any application.

The first few steps below deploy a sample application, which you can use for testing. The remaining steps demonstrate steps to follow to troubleshoot why metrics from that application might not be appearing in Cloud Monitoring.

  1. Create a file named prometheus-example-app.yaml containing the following:

    # This example application exposes prometheus metrics on port 1234.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: prom-example
      name: prom-example
      namespace: gke-workload-metrics
    spec:
      selector:
        matchLabels:
          app: prom-example
      template:
        metadata:
          labels:
            app: prom-example
        spec:
          containers:
          - image: nilebox/prometheus-example-app@sha256:dab60d038c5d6915af5bcbe5f0279a22b95a8c8be254153e22d7cd81b21b84c5
            name: prom-example
            ports:
            - name: metrics-port
              containerPort: 1234
            command:
            - "/main"
            - "--process-metrics"
            - "--go-metrics"
    
  2. Initialize the credential using gcloud command-line tool in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  3. Create the gke-workload-metrics namespace:

    kubectl create ns gke-workload-metrics
    
  4. Deploy the example application:

    kubectl apply -f prometheus-example-app.yaml
    
  5. Confirm the example application is running by running this command:

    kubectl get pod -n gke-workload-metrics -l app=prom-example
    

    The output should look similar to this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-775d8685f4-ljlkd   1/1     Running   0          25m
    
  6. Confirm your application is exposing metrics as expected

    Using one of the pods returned from the above command, check that the metrics endpoint is working correctly:

    POD_NAME=prom-example-775d8685f4-ljlkd
    NAMESPACE=gke-workload-metrics
    PORT_NUMBER=1234
    METRICS_PATH=/metrics
    kubectl get --raw /api/v1/namespaces/$NAMESPACE/pods/$POD_NAME:$PORT_NUMBER/proxy/$METRICS_PATH
    

    If using the example application above, you should receive output similar to this:

    # HELP example_random_numbers A histogram of normally distributed random numbers.
    # TYPE example_random_numbers histogram
    example_random_numbers_bucket{le="0"} 501
    example_random_numbers_bucket{le="0.1"} 1.75933011554e+11
    example_random_numbers_bucket{le="0.2"} 3.50117676362e+11
    example_random_numbers_bucket{le="0.30000000000000004"} 5.20855682325e+11
    example_random_numbers_bucket{le="0.4"} 6.86550977647e+11
    example_random_numbers_bucket{le="0.5"} 8.45755380226e+11
    example_random_numbers_bucket{le="0.6"} 9.97201199544e+11
    ...
    
  7. Create a file named my-pod-monitor.yaml containing the following:

    # Note that this PodMonitor is in the monitoring.gke.io domain,
    # rather than the monitoring.coreos.com domain used with the
    # Prometheus Operator.  The PodMonitor supports a subset of the
    # fields in the Prometheus Operator's PodMonitor.
    apiVersion: monitoring.gke.io/v1alpha1
    kind: PodMonitor
    metadata:
      name: example
    # spec describes how to monitor a set of pods in a cluster.
    spec:
      # namespaceSelector determines which namespace is searched for pods. Required
      namespaceSelector:
        matchNames:
        - gke-workload-metrics
      # selector determines which pods are monitored.  Required
      # This example matches pods with the `app: prom-example` label
      selector:
        matchLabels:
          app: prom-example
      podMetricsEndpoints:
        # port is the name of the port of the container to be scraped.
      - port: metrics-port
        # path is the path of the endpoint to be scraped.
        # Default /metrics
        path: /metrics
        # scheme is the scheme of the endpoint to be scraped.
        # Default http
        scheme: http
        # interval is the time interval at which metrics should
        # be scraped. Default 60s
        interval: 20s
    
  8. Create this PodMonitor resource:

    kubectl apply -f my-pod-monitor.yaml
    

    Once you have created the PodMonitor resource, the G​K​E workload metrics pipeline will detect appropriate pods and will automatically start scraping them periodically. The pipeline will send collected metrics to Cloud Monitoring

  9. Confirm the label and namespace selectors are set correctly in your PodMonitor. Update the values of NAMESPACE and SELECTOR below to reflect the namespaceSelector and matchLabels in your PodMonitor custom resource. Then run this command:

    NAMESPACE=gke-workload-metrics
    SELECTOR=app=prom-example
    kubectl get pods --namespace $NAMESPACE --selector $SELECTOR
    

    You should see a result like this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-7cff4db5fc-wp8lw   1/1     Running   0          39m
    
  10. Confirm the PodMonitor is in the Ready state.

    Run this command to return all of the PodMonitors you have installed in your cluster:

    kubectl get podmonitor.monitoring.gke.io --all-namespaces
    

    You should see output similar to this:

    NAMESPACE   NAME      AGE
    default     example   2m36s
    

    Identify the relevant PodMonitor from the set returned and run this command (replacing example in the command below with the name of your PodMonitor):

    kubectl describe podmonitor.monitoring.gke.io example --namespace default
    

    Examine the results returned by kubectl describe and confirm the "Ready" condition is True. If the Ready is False, look for events that indicate why it isn't Ready.

  11. Next, we'll confirm these metrices are indeed received by Cloud Monitoring. In the Cloud Monitoring section of the Google Cloud Console, go to Metrics explorer.

  12. In the Metric field, type example_requests_total.

  13. In the dropdown menu that appears, select workload.googleapis.com/example_requests_total.

    This example_requests_total metric is one of the Prometheus metrics emitted by the example application.

    If the dropdown menu doesn't appear or if you don't see workload.googleapis.com/example_requests_total in the dropdown menu, try again in a few minutes.

    All metrics are associated with the Kubernetes Container (k8s_container) resource they are collected from. You can use the Resource type field of Metrics Explorer to select k8s_container. You can also group by any labels such as namespace_name or pod_name.

    This metric can be used anywhere within Cloud Monitoring or queried via the Cloud Monitoring API. For example, to add this chart to an existing or new dashboard, click on the Save Chart button in the top right corner and select the desired dashboard in a dialog window.

Check for errors sending to the Cloud Monitoring API

Metrics sent to Cloud Monitoring need to stay within the custom metric limits. Errors will show up in the Cloud Monitoring audit logs.

Using the Cloud Logging Logs Explorer, look in the logs with this log filter (replace PROJECT_ID with the name of your project):

resource.type="audited_resource"
resource.labels.service="monitoring.googleapis.com"
protoPayload.authenticationInfo.principalEmail=~".*-compute@developer.gserviceaccount.com"
resource.labels.project_id="[PROJECT_ID]"
severity>=ERROR

Note that this will show errors for all writes to Cloud Monitoring for the project, not just those from your cluster.

Check your Cloud Monitoring ingestion Quota

  1. Go to the Cloud Monitoring API Quotas page
  2. Select the relevant project
  3. Expand the Time series ingestion requests section
  4. Confirm that the Quota exceeded errors count for "Time series ingestion requests per minute" is 0. (If this is the case, the "Quota exceeded errors count" graph will indicate "No data is available for the selected time frame.")
  5. If the peak usage percentage exceeds 100%, then consider selecting fewer pods, selecting fewer metrics, or requesting a higher quota limit for the monitoring.googleapis.com/ingestion_requests quota.

Confirm the DaemonSet is deployed

Ensure that the workload metrics DaemonSet is deployed in your cluster. You can validate this is working as expected by using kubectl:

  1. Initialize the credential using gcloud command-line tool in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  2. Check that the workload-metrics DaemonSet is present and healthy. Run this:

    kubectl get ds -n kube-system workload-metrics
    

    If components were deployed successfully, you will see something similar to the following output:

    NAME               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   AGE
    workload-metrics   3         3         3       3            3           2m
    

    The number of replicas above should match the number of Linux GKE nodes in your cluster. For example, a cluster with 3 nodes will have DESIRED = 3. After a few minutes, READY and AVAILABLE numbers should match the DESIRED number. Otherwise, there might be an issue with deployment.

Other metrics

Besides system metrics and workload metrics described above, some additional metrics available for a G​K​E cluster include: