Managing GKE metrics

Google Kubernetes Engine (GKE) makes it easy to send metrics to Cloud Monitoring. Once in Cloud Monitoring, metrics can populate custom dashboards, generate alerts, create Service Level Objectives, or be fetched by 3rd party monitoring services using the Monitoring API.

G​K​E provides several sources of metrics:

  • System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
  • Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
  • Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
  • Workload metrics (Deprecated): metrics exposed by any G​K​E workload (such as a CronJob or a Deployment for an application).

System metrics

When a cluster is created, G​K​E by default collects certain metrics emitted by system components.

You have a choice whether or not to send metrics from your G​K​E cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.

All G​K​E system metrics are ingested into Cloud Monitoring with the prefix kubernetes.io.

Pricing

Cloud Monitoring does not charge for the ingestion of G​K​E system metrics. Learn more about Cloud Monitoring pricing.

Configuring collection of system metrics

To enable system metric collection, pass the SYSTEM value to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

To disable system metric collection, use the NONE value for the --monitoring flag. If system metric collection is disabled, basic information like CPU usage, memory usage, and disk usage are not available for a cluster in the G​K​E section of the Google Cloud console. Additionally, the Cloud Monitoring G​K​E Dashboard does not contain information about the cluster.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring integration with G​K​E.

List of system metrics

System metrics include metrics from essential system components important for core Kubernetes functionality. See a complete list of system metrics.

Control plane metrics

You can configure a G​K​E cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.

Requirements

Sending metrics emitted by Kubernetes control plane components to Cloud Monitoring requires G​K​E control plane version 1.23.6 or later and requires that the collection of system metrics be enabled.

Configuring collection of control plane metrics

To enable Kubernetes control plane metrics in an existing G​K​E cluster, follow these steps:

CONSOLE

  1. In the Google Cloud console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the row labelled Cloud Monitoring, click the Edit icon.

  4. In the Edit Cloud Monitoring dialog box that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the Components dropdown menu, select the control plane components from which you would like to collect metrics: API Server, Scheduler, or Controller Manager.

  6. Click OK.

  7. Click Save Changes.

GCLOUD

  1. Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Pass one or more of the values API_SERVER, SCHEDULER, or CONTROLLER_MANAGER to the --monitoring flag of the gcloud container clusters create or gcloud container clusters update commands.

    For example, to collect metrics from the API server, scheduler, and controller manager, run this command:

    gcloud container clusters update [CLUSTER_ID] \
      --zone=[ZONE] \
      --project=[PROJECT_ID] \
      --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER
    

Metric format

All Kubernetes control plane metrics written to Cloud Monitoring use the resource type prometheus_target. Each metric name is prefixed with prometheus.googleapis.com/ and has a postfix indicating the PromQL metric type such as /gauge or /histogram or /counter. Otherwise, each metric name is identical to the metric name exposed by open source Kubernetes.

Pricing

G​K​E control plane metrics uses Google Cloud Managed Service for Prometheus to ingest metrics into Cloud Monitoring. Cloud Monitoring charges for the ingestion of G​K​E control plane metrics based on the number of samples ingested. Learn about Cloud Monitoring pricing.

Understanding your Monitoring bill

To identify which control plane metrics have the largest number of samples being ingested, use the monitoring.googleapis.com/collection/attribution/write_sample_count metric:

  1. In the console, select Monitoring:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. In the Metric field, select monitoring.googleapis.com/collection/attribution/write_sample_count.

  4. Click Add Filter.

  5. In the Label field, select attribution_dimension.

  6. In the Comparison field, select = (equals).

  7. In the Value field, enter cluster.

  8. Click Done.

  9. Optionally, filter for only certain metrics. In particular, since API server metrics all include "apiserver" as part of the metric name and since Scheduler metrics all include "scheduler" as part of the metric name, you can restrict to metrics containing those strings:

    • Click Add Filter.

    • In the Label field, select metric_type.

    • In the Comparison field, select =~ (equals regex).

    • In the Value field, enter .*apiserver.* or .*scheduler.*.

    • Click Done.

  10. Optionally, group the number of samples ingested by G​K​E region or project:

    • Click Group by.

    • Ensure metric_type is selected.

    • To group by G​K​E region, select location.

    • To group by project, select project_id.

    • Click OK.

  11. Optionally, group the number of samples ingested by G​K​E cluster name:

    • Click Group by.

    • To group by G​K​E cluster name, ensure both attribution_dimension and attribution_id are selected.

    • Click OK.

  12. Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since G​K​E control plane metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.

Exporting from Cloud Monitoring

Control plane metrics can be exported from Cloud Monitoring by using the Cloud Monitoring API. Since all control plane metrics are ingested using Managed Service for Prometheus, control plane metrics can be queried using PromQL or queried using MQL.

Quota

Control plane metrics consume the "Time series ingestion requests per minute" quota of the Cloud Monitoring API. Before enabling control plane metrics, you may want to check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota's limit, you may want to request a quota limit increase before enabling control plane metrics.

Querying metrics

When you query control plane metrics, the name you use depends on whether you are using PromQL or Cloud Monitoring features like Metrics Explorer, MQL, or dashboards. The tables in the API server metrics, Scheduler metrics, and Controller Manager metrics sections below show two versions of each metric name:

  • PromQL metric name: When using PromQL in the Managed Prometheus page of the console or in the Cloud Monitoring API, use the PromQL metric name.
  • Cloud Monitoring metric name: When using other Monitoring features, use the Cloud Monitoring metric name in the tables below. This name must be prefixed with prometheus.googleapis.com/, which has been omitted from the entries in the table.

API server metrics

When API server metrics are enabled, all metrics shown in the table below are exported to Cloud Monitoring in the same project as the G​K​E cluster.

The Cloud Monitoring metric names in this table must be prefixed with prometheus.googleapis.com/. That prefix has been omitted from the entries in the table.

PromQL metric name Launch stage
Cloud Monitoring metric name
Kind, Type, Unit
Monitored resources
Required GKE version
Description
Labels
apiserver_current_inflight_requests GA
apiserver_current_inflight_requests/gauge
GaugeDouble1
prometheus_target
1.23.6+
Maximal number of currently used inflight request limit of this apiserver per request kind in last second.

request_kind
apiserver_request_duration_seconds GA
apiserver_request_duration_seconds/histogram
CumulativeDistributions
prometheus_target
1.23.6+
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.

component
dry_run
group
resource
scope
subresource
verb
version
apiserver_request_total GA
apiserver_request_total/counter
CumulativeDouble1
prometheus_target
1.23.6+
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.

code
component
dry_run
group
resource
scope
subresource
verb
version
apiserver_response_sizes GA
apiserver_response_sizes/histogram
CumulativeDistribution1
prometheus_target
1.23.6+
Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.

component
group
resource
scope
subresource
verb
version
apiserver_storage_objects GA
apiserver_storage_objects/gauge
GaugeDouble1
prometheus_target
1.23.6+
Number of stored objects at the time of last check split by kind.

resource
apiserver_admission_controller_admission_duration_seconds GA
apiserver_admission_controller_admission_duration_seconds/histogram
CumulativeDistributions
prometheus_target
1.23.6+
Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).

name
operation
rejected
type
apiserver_admission_step_admission_duration_seconds GA
apiserver_admission_step_admission_duration_seconds/histogram
CumulativeDistributions
prometheus_target
1.23.6+
Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).

operation
rejected
type
apiserver_admission_webhook_admission_duration_seconds GA
apiserver_admission_webhook_admission_duration_seconds/histogram
CumulativeDistributions
prometheus_target
1.23.6+
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).

name
operation
rejected
type

Scheduler metrics

When scheduler metrics are enabled, all metrics shown in the table below are exported to Cloud Monitoring in the same project as the G​K​E cluster.

The Cloud Monitoring metric names in this table must be prefixed with prometheus.googleapis.com/. That prefix has been omitted from the entries in the table.

PromQL metric name Launch stage
Cloud Monitoring metric name
Kind, Type, Unit
Monitored resources
Required GKE version
Description
Labels
scheduler_pending_pods GA
scheduler_pending_pods/gauge
GaugeDouble1
prometheus_target
1.23.6+
Number of pending pods, by the queue type. 'active' means number of pods in activeQ; 'backoff' means number of pods in backoffQ; 'unschedulable' means number of pods in unschedulablePods.

queue
scheduler_preemption_attempts_total GA
scheduler_preemption_attempts_total/counter
CumulativeDouble1
prometheus_target
1.23.6+
Total preemption attempts in the cluster till now
scheduler_preemption_victims GA
scheduler_preemption_victims/histogram
CumulativeDistribution1
prometheus_target
1.23.6+
Number of selected preemption victims
scheduler_scheduling_attempt_duration_seconds GA
scheduler_scheduling_attempt_duration_seconds/histogram
CumulativeDistribution1
prometheus_target
1.23.6+
Scheduling attempt latency in seconds (scheduling algorithm + binding).

profile
result
scheduler_schedule_attempts_total GA
scheduler_schedule_attempts_total/counter
CumulativeDouble1
prometheus_target
1.23.6+
Number of attempts to schedule pods, by the result. 'unschedulable' means a pod could not be scheduled, while 'error' means an internal scheduler problem.

profile
result

Controller Manager metrics

When controller manager metrics are enabled, all metrics shown in the table below are exported to Cloud Monitoring in the same project as the G​K​E cluster.

The Cloud Monitoring metric names in this table must be prefixed with prometheus.googleapis.com/. That prefix has been omitted from the entries in the table.

PromQL metric name Launch stage
Cloud Monitoring metric name
Kind, Type, Unit
Monitored resources
Required GKE version
Description
Labels
node_collector_evictions_total GA
node_collector_evictions_total/counter
CumulativeDouble1
prometheus_target
1.24+
Number of Node evictions that happened since current instance of NodeController started.

zone

Workload metrics

If your G​K​E cluster is using workload metrics, migrate to Managed Service for Prometheus before upgrading your cluster to G​K​E 1.24, since workload metrics support is removed in G​K​E 1.24.

G​K​E 1.20.8-gke.2100 or later and G​K​E versions less than 1.24 offer a fully managed metric-collection pipeline to scrape Prometheus-style metrics exposed by any G​K​E workload (such as a CronJob or a Deployment for an application) and send those metrics to Cloud Monitoring.

All metrics collected by the G​K​E workload metrics pipeline are ingested into Cloud Monitoring with the prefix workload.googleapis.com.

Benefits of G​K​E workload metrics include:

  • Easy setup: With a single kubectl command to deploy a PodMonitor custom resource, you can start collecting metrics. No manual installation of an agent is required.
  • Highly configurable: Adjust scrape endpoints, frequency and other parameters.
  • Fully managed: Google maintains the pipeline, so you can focus on your applications.
  • Control costs: Easily manage Cloud Monitoring costs through flexible metric filtering.
  • Open standard: Configure workload metrics using the PodMonitor custom resource, which is modeled after the Prometheus Operator's PodMonitor resource.
  • Better pricing: More intuitive, predictable, and lower pricing.
  • Autopilot support: Supports both G​K​E Standard and G​K​E Autopilot clusters.

Requirements

G​K​E workload metrics requires G​K​E control plane version 1.20.8-gke.2100 or later and does not support G​K​E Windows workloads.

If you enable the workload metrics pipeline, then you must also enable the collection of system metrics.

Step 1: Enable the workload metrics pipeline

To enable the workload metrics collection pipeline in an existing G​K​E cluster, following these steps:

CONSOLE

  1. In the Google Cloud console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the row labelled "Cloud Monitoring", click the Edit icon.

  4. In the "Edit Cloud Monitoring" dialog box that appears, confirm that Enable Cloud Monitoring is selected.

  5. In the dropdown menu, select Workloads.

  6. Click OK.

  7. Click Save Changes.

GCLOUD

  1. Open a terminal window with Google Cloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud beta container clusters update [CLUSTER_ID] \
      --zone=[ZONE] \
      --project=[PROJECT_ID] \
      --monitoring=SYSTEM,WORKLOAD
    

    Including the WORKLOAD value for the --monitoring flag of the gcloud beta container clusters create or gcloud beta container clusters update commands enables the workload metrics collection pipeline.

Enabling the workload metrics collection pipeline deploys a metrics-collection agent on each node that is capable of collecting application metrics emitted by Kubernetes workloads.

See Configuring Cloud Operations for GKE for more details about Cloud Monitoring's integration with G​K​E.

Step 2: Configure which metrics are collected

1) Create a PodMonitor custom resource named my-pod-monitor.yaml:


apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  # POD_MONITOR_NAME is how you identify your PodMonitor
  name: [POD_MONITOR_NAME]
  # POD_NAMESPACE is the namespace where your workload is running, and the
  # namespace of your PodMonitor object
  namespace: [POD_NAMESPACE]
spec:
  # POD_LABEL_KEY and POD_LABEL_VALUE identify which pods to collect metrics
  # from. For example, POD_LABEL_KEY of app.kubernetes.io/name and
  # POD_LABEL_VALUE of mysql would collect metrics from all pods with the label
  # app.kubernetes.io/name=mysql
  selector:
    matchLabels:
      [POD_LABEL_KEY]: [POD_LABEL_VALUE]
  podMetricsEndpoints:
    # CONTAINER_PORT_NAME is the name of the port of the container to be scraped
    # Use the following command to list all ports exposed by the container:
    # kubectl get pod [POD_NAME] -n [POD_NAMESPACE] -o json | jq '.items[].spec.containers[].ports[]?' | grep -v null
    # If the port for your metrics does not have a name, modify your application's pod spec
    # to add a port name.
  - port: [CONTAINER_PORT_NAME]

2) Initialize the credential using Google Cloud CLI in order to set up kubectl:

gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]

3) Deploy the PodMonitor custom resource:

kubectl apply -f my-pod-monitor.yaml

Pricing

Cloud Monitoring charges for the ingestion of G​K​E workload metrics based on the number of samples ingested. Learn more about Cloud Monitoring pricing.

Managing costs

To manage costs, start by determining which workload metrics are most valuable for your monitoring needs. The G​K​E workload metrics pipeline provides fine-grained controls to achieve the right trade-off between capturing detailed metrics and keeping costs low.

Many applications expose a wide variety of Prometheus metrics, and by default, the G​K​E workload metrics pipeline scrapes all metrics from each selected pod every 60 seconds.

You can use the following techniques to reduce the cost of metrics:

  1. Adjusting scrape frequency: To lower the cost of metrics, we recommend reducing the scrape frequency when appropriate. For example, it's possible a business-relevant KPI might change slowly enough that it can be scraped every ten minutes. In the PodMonitor, set the interval to control scraping frequency.

  2. Filtering metrics: Identify any metrics that are not being used in Cloud Monitoring and use metricRelabelings to ensure that only metrics useful for dashboards, alerts, or SLOs are sent to Cloud Monitoring.

Here's a concrete example of a PodMonitor custom resource using both techniques:

apiVersion: monitoring.gke.io/v1alpha1
kind: PodMonitor
metadata:
  name: prom-example
  namespace: gke-workload-metrics
spec:
  selector:
    matchLabels:
      app: example
  podMetricsEndpoints:
  - port: metrics
    path: /metrics
    scheme: http

    # (1) scrape metrics less frequently than the default (once every 60s)
    interval: 10m

    metricRelabelings:

    - # (2) drop the irrelevant metric named "foo" and all metrics
      # with the prefix "bar_"
      sourceLabels: [__name__]
      regex: foo|bar_.*
      action: drop

    - # (3) keep only metrics with a subset of values for a particular label
      sourceLabels: [region]
      regex: us-.*
      action: keep

To identify which metrics have the largest number of samples being ingested, use the monitoring.googleapis.com/collection/attribution/write_sample_count metric:

  1. In the console, select Monitoring:

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.

  3. In the Metric field, select monitoring.googleapis.com/collection/attribution/write_sample_count.

  4. Optionally, filter for only G​K​E workload metrics:

    • Click Add Filter.

    • In the Label field, select metric_domain.

    • In the Value field, enter workload.googleapis.com.

    • Click Done.

  5. Optionally, group the number of samples ingested by Kubernetes namespace, G​K​E region, Google Cloud project, or the monitored resource type:

    • Click Group by.

    • To group by Kubernetes namespace, select attribution_dimension and attribution_id.

    • To group by G​K​E region, select location.

    • To group by Cloud project, select resource_container.

    • To group by monitored resource type, select resource_type.

  6. Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.

These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since G​K​E workload metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested. Consider whether you can reduce the scrape frequency of any of these metrics or whether you can stop collecting any of them.

Finally, there are many resources available to understand the cost of G​K​E metrics ingested into Cloud Monitoring and to optimize those costs. See the cost optimization guide for additional ways to reduce the cost of Cloud Monitoring.

Earlier G​K​E versions

To collect application metrics in a cluster with a G​K​E control plane version less than 1.20.8-gke.2100, use Stackdriver Prometheus sidecar.

Troubleshooting

If metrics are not available in Cloud Monitoring as expected, use the steps in the following sections to troubleshoot.

Troubleshooting system metrics

If system metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm that the metrics agent has sufficient memory

In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:

  1. Get the names of the GKE metrics agent Pods:

    kubectl get pods -n kube-system -l component=gke-metrics-agent
    

    Find the Pod with the status CrashLoopBackOff.

    The output is similar to the following:

    NAME                    READY STATUS           RESTARTS AGE
    gke-metrics-agent-5857x 0/1   CrashLoopBackOff 6        12m
    
  2. Describe the Pod that has the status CrashLoopBackOff:

    kubectl describe pod POD_NAME -n kube-system
    

    Replace the POD_NAME with the name of the Pod from the previous step.

    If the termination reason of the Pod is OOMKilled, the agent needs additional memory.

    The output is similar to the following:

      containerStatuses:
      ...
      lastState:
        terminated:
          ...
          exitCode: 1
          finishedAt: "2021-11-22T23:36:32Z"
          reason: OOMKilled
          startedAt: "2021-11-22T23:35:54Z"
    
  3. Add a temporary node label to the node with the failing metrics agent. This label will not persist after an upgrade.

    kubectl label node/NODE_NAME \
    ADDITIONAL_MEMORY_NODE_LABEL --overwrite
    

    Replace ADDITIONAL_MEMORY_NODE_LABEL with one of the following:

    • To add an additional 10 MB: cloud.google.com/gke-metrics-agent-scaling-level=10
    • To add an additional 20 MB: cloud.google.com/gke-metrics-agent-scaling-level=20

    Replace NODE_NAME with the name of the node of the affected metrics agent.

    Alternatively, you can create a new node pool with a persistent node label and use node taints to migrate your workloads to the new node pool.

    To create a node pool with a persistent label, run the following command:

    gcloud container node-pools create NODEPOOL_NAME \
     --cluster=CLUSTER_NAME  \
     --node-labels=ADDITIONAL_MEMORY_NODE_LABEL
    

    Replace the following:

    • CLUSTER_NAME: the name of the existing cluster.
    • NODEPOOL_NAME: the name of the new node pool.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels from the previous step, adding 10 MB or 20 MB of additional memory.

Troubleshooting workload metrics

If workload metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.

Confirm your cluster meets minimum requirements

Ensure that your G​K​E cluster is running control plane version 1.20.8-gke.2100 or later. If not, upgrade your cluster's control plane.

Confirm workload metrics is enabled

Ensure that your G​K​E cluster is configured to enable the workload metrics collection pipeline by following these steps:

CONSOLE

  1. In the Google Cloud console, go to the list of G​K​E clusters:

    Go to Kubernetes clusters

  2. Click your cluster's name.

  3. In the Details panel for your cluster, confirm that "Workload" is included in the status for Cloud Monitoring. If "Workload" is not shown here, then enable workload metrics.

GCLOUD

  1. Open a terminal window with gcloud CLI installed. One way to do this is to use Cloud Shell.

  2. In the console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  3. Run this command:

    gcloud container clusters describe [CLUSTER_ID] --zone=[ZONE]
    

    In the output of that command, look for the monitoringConfig: line. A couple lines later, confirm that the enableComponents: section includes WORKLOADS. If WORKLOADS is not shown here, then enable workload metrics.

Confirm metrics are collected from your application

If you are not able to view your application's metrics in Cloud Monitoring, the steps below can help troubleshoot issues. These steps use a sample application for demonstratation purposes, but you can apply the same steps below to troubleshoot any application.

The first few steps below deploy a sample application, which you can use for testing. The remaining steps demonstrate steps to follow to troubleshoot why metrics from that application might not be appearing in Cloud Monitoring.

  1. Create a file named prometheus-example-app.yaml containing the following:

    # This example application exposes prometheus metrics on port 1234.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: prom-example
      name: prom-example
      namespace: gke-workload-metrics
    spec:
      selector:
        matchLabels:
          app: prom-example
      template:
        metadata:
          labels:
            app: prom-example
        spec:
          containers:
          - image: nilebox/prometheus-example-app@sha256:dab60d038c5d6915af5bcbe5f0279a22b95a8c8be254153e22d7cd81b21b84c5
            name: prom-example
            ports:
            - name: metrics-port
              containerPort: 1234
            command:
            - "/main"
            - "--process-metrics"
            - "--go-metrics"
    
  2. Initialize the credential using Google Cloud CLI in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  3. Create the gke-workload-metrics namespace:

    kubectl create ns gke-workload-metrics
    
  4. Deploy the example application:

    kubectl apply -f prometheus-example-app.yaml
    
  5. Confirm the example application is running by running this command:

    kubectl get pod -n gke-workload-metrics -l app=prom-example
    

    The output should look similar to this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-775d8685f4-ljlkd   1/1     Running   0          25m
    
  6. Confirm your application is exposing metrics as expected

    Using one of the pods returned from the above command, check that the metrics endpoint is working correctly:

    POD_NAME=prom-example-775d8685f4-ljlkd
    NAMESPACE=gke-workload-metrics
    PORT_NUMBER=1234
    METRICS_PATH=/metrics
    kubectl get --raw /api/v1/namespaces/$NAMESPACE/pods/$POD_NAME:$PORT_NUMBER/proxy/$METRICS_PATH
    

    If using the example application above, you should receive output similar to this:

    # HELP example_random_numbers A histogram of normally distributed random numbers.
    # TYPE example_random_numbers histogram
    example_random_numbers_bucket{le="0"} 501
    example_random_numbers_bucket{le="0.1"} 1.75933011554e+11
    example_random_numbers_bucket{le="0.2"} 3.50117676362e+11
    example_random_numbers_bucket{le="0.30000000000000004"} 5.20855682325e+11
    example_random_numbers_bucket{le="0.4"} 6.86550977647e+11
    example_random_numbers_bucket{le="0.5"} 8.45755380226e+11
    example_random_numbers_bucket{le="0.6"} 9.97201199544e+11
    ...
    
  7. Create a file named my-pod-monitor.yaml containing the following:

    # Note that this PodMonitor is in the monitoring.gke.io domain,
    # rather than the monitoring.coreos.com domain used with the
    # Prometheus Operator.  The PodMonitor supports a subset of the
    # fields in the Prometheus Operator's PodMonitor.
    apiVersion: monitoring.gke.io/v1alpha1
    kind: PodMonitor
    metadata:
      name: example
      namespace: gke-workload-metrics
    # spec describes how to monitor a set of pods in a cluster.
    spec:
      # selector determines which pods are monitored.  Required
      # This example matches pods with the `app: prom-example` label
      selector:
        matchLabels:
          app: prom-example
      podMetricsEndpoints:
        # port is the name of the port of the container to be scraped.
      - port: metrics-port
        # path is the path of the endpoint to be scraped.
        # Default /metrics
        path: /metrics
        # scheme is the scheme of the endpoint to be scraped.
        # Default http
        scheme: http
        # interval is the time interval at which metrics should
        # be scraped. Default 60s
        interval: 20s
    
  8. Create this PodMonitor resource:

    kubectl apply -f my-pod-monitor.yaml
    

    Once you have created the PodMonitor resource, the G​K​E workload metrics pipeline will detect appropriate pods and will automatically start scraping them periodically. The pipeline will send collected metrics to Cloud Monitoring

  9. Confirm the label and namespace is set correctly in your PodMonitor. Update the values of NAMESPACE and SELECTOR below to reflect the namespace and matchLabels in your PodMonitor custom resource. Then run this command:

    NAMESPACE=gke-workload-metrics
    SELECTOR=app=prom-example
    kubectl get pods --namespace $NAMESPACE --selector $SELECTOR
    

    You should see a result like this:

    NAME                            READY   STATUS    RESTARTS   AGE
    prom-example-7cff4db5fc-wp8lw   1/1     Running   0          39m
    
  10. Confirm the PodMonitor is in the Ready state.

    Run this command to return all of the PodMonitors you have installed in your cluster:

    kubectl get podmonitor.monitoring.gke.io --all-namespaces
    

    You should see output similar to this:

    NAMESPACE              NAME      AGE
    gke-workload-metrics   example   2m36s
    

    Identify the relevant PodMonitor from the set returned and run this command (replacing example in the command below with the name of your PodMonitor):

    kubectl describe podmonitor.monitoring.gke.io example --namespace gke-workload-metrics
    

    Examine the results returned by kubectl describe and confirm the "Ready" condition is True. If the Ready is False, look for events that indicate why it isn't Ready.

  11. Next, we'll confirm these metrices are indeed received by Cloud Monitoring. In the Cloud Monitoring section of the Google Cloud console, go to Metrics explorer.

  12. In the Metric field, type example_requests_total.

  13. In the dropdown menu that appears, select workload.googleapis.com/example_requests_total.

    This example_requests_total metric is one of the Prometheus metrics emitted by the example application.

    If the dropdown menu doesn't appear or if you don't see workload.googleapis.com/example_requests_total in the dropdown menu, try again in a few minutes.

    All metrics are associated with the Kubernetes Container (k8s_container) resource they are collected from. You can use the Resource type field of Metrics Explorer to select k8s_container. You can also group by any labels such as namespace_name or pod_name.

    This metric can be used anywhere within Cloud Monitoring or queried via the Cloud Monitoring API. For example, to add this chart to an existing or new dashboard, click on the Save Chart button in the top right corner and select the desired dashboard in a dialog window.

Check for errors sending to the Cloud Monitoring API

Metrics sent to Cloud Monitoring need to stay within the custom metric limits. Errors will show up in the Cloud Monitoring audit logs.

Using the Cloud Logging Logs Explorer, look in the logs with this log filter (replace PROJECT_ID with the name of your project):

resource.type="audited_resource"
resource.labels.service="monitoring.googleapis.com"
protoPayload.authenticationInfo.principalEmail=~".*-compute@developer.gserviceaccount.com"
resource.labels.project_id="[PROJECT_ID]"
severity>=ERROR

Note that this will show errors for all writes to Cloud Monitoring for the project, not just those from your cluster.

Check your Cloud Monitoring ingestion Quota

  1. Go to the Cloud Monitoring API Quotas page
  2. Select the relevant project
  3. Expand the Time series ingestion requests section
  4. Confirm that the Quota exceeded errors count for "Time series ingestion requests per minute" is 0. (If this is the case, the "Quota exceeded errors count" graph will indicate "No data is available for the selected time frame.")
  5. If the peak usage percentage exceeds 100%, then consider selecting fewer pods, selecting fewer metrics, or requesting a higher quota limit for the monitoring.googleapis.com/ingestion_requests quota.

Confirm the DaemonSet is deployed

Ensure that the workload metrics DaemonSet is deployed in your cluster. You can validate this is working as expected by using kubectl:

  1. Initialize the credential using Google Cloud CLI in order to set up kubectl:

    gcloud container clusters get-credentials [CLUSTER_ID] --zone=[ZONE]
    
  2. Check that the workload-metrics DaemonSet is present and healthy. Run this:

    kubectl get ds -n kube-system workload-metrics
    

    If components were deployed successfully, you will see something similar to the following output:

    NAME               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   AGE
    workload-metrics   3         3         3       3            3           2m
    

    The number of replicas above should match the number of Linux GKE nodes in your cluster. For example, a cluster with 3 nodes will have DESIRED = 3. After a few minutes, READY and AVAILABLE numbers should match the DESIRED number. Otherwise, there might be an issue with deployment.

Other metrics

In addition to the system metrics and control plane metrics in this document, Istio metrics are also available for G​K​E clusters.