Google Kubernetes Engine (GKE) makes it easy to send metrics to Cloud Monitoring. Once in Cloud Monitoring, metrics can populate custom dashboards, generate alerts, create service-level objectives, or be fetched by third-party monitoring services using the Cloud Monitoring API.
GKE provides several sources of metrics:
- System metrics: metrics from essential system components, describing low-level resources such as CPU, memory and storage.
- Managed Service for Prometheus: lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
- Control plane metrics: metrics exported from certain control plane components such as the API server and scheduler.
- Workload metrics (Deprecated): metrics exposed by any GKE workload (such as a CronJob or a Deployment for an application).
System metrics
When a cluster is created, GKE by default collects certain metrics emitted by system components.
You have a choice whether or not to send metrics from your GKE cluster to Cloud Monitoring. If you choose to send metrics to Cloud Monitoring, you must send system metrics.
All GKE system metrics are ingested into Cloud Monitoring with the
prefix kubernetes.io
.
Pricing
Cloud Monitoring does not charge for the ingestion of GKE system metrics. For more information, see Cloud Monitoring pricing.
Configuring collection of system metrics
To enable system metric collection, pass the SYSTEM
value to the
--monitoring
flag of the
gcloud container clusters create
or
gcloud container clusters update
commands.
To disable system metric collection, use the NONE
value for the --monitoring
flag. If system metric collection is disabled, basic information like CPU usage,
memory usage, and disk usage are not available for a cluster in the
Observability tab
or the GKE section of the
Google Cloud console. Additionally, the
Cloud Monitoring GKE Dashboard
does not contain information about the cluster.
See Configuring Cloud Operations for GKE for more details about Cloud Monitoring integration with GKE.
To configure the collection of system metrics by using Terraform,
see the monitoring_config
block in the Terraform registry for
google_container_cluster
.
For general information about using Google Cloud with Terraform, see
Terraform with Google Cloud.
List of system metrics
System metrics include metrics from essential system components important for core Kubernetes functionality. For a list of these metrics, see GKE system metrics.
Troubleshooting system metrics
If system metrics are not available in Cloud Monitoring as expected, here are some steps you can take to troubleshoot the issue.
Confirm that the metrics agent has sufficient memory
In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:
Get the names of the GKE metrics agent Pods:
kubectl get pods -n kube-system -l component=gke-metrics-agent
Find the Pod with the status
CrashLoopBackOff
.The output is similar to the following:
NAME READY STATUS RESTARTS AGE gke-metrics-agent-5857x 0/1 CrashLoopBackOff 6 12m
Describe the Pod that has the status
CrashLoopBackOff
:kubectl describe pod POD_NAME -n kube-system
Replace POD_NAME with the name of the Pod from the previous step.
If the termination reason of the Pod is
OOMKilled
, the agent needs additional memory.The output is similar to the following:
containerStatuses: ... lastState: terminated: ... exitCode: 1 finishedAt: "2021-11-22T23:36:32Z" reason: OOMKilled startedAt: "2021-11-22T23:35:54Z"
Add a node label to the node with the failing metrics agent. You can use either a persistent or temporary node label. We recommend you try adding an additional 20 MB. If the agent keeps crashing, you can run this command again, replacing the node label with one requesting a higher amount of additional memory.
To update a node pool with a persistent label, run the following command:
gcloud container node-pools update NODEPOOL_NAME \ --cluster=CLUSTER_NAME \ --node-labels=ADDITIONAL_MEMORY_NODE_LABEL \ --zone ZONE
Replace the following:
NODEPOOL_NAME
: the name of the node pool.CLUSTER_NAME
: the name of the existing cluster.ADDITIONAL_MEMORY_NODE_LABEL
: one of the additional memory node labels; use one one of the following:- To add 10 MB:
cloud.google.com/gke-metrics-agent-scaling-level=10
- To add 20 MB:
cloud.google.com/gke-metrics-agent-scaling-level=20
- To add 50 MB:
cloud.google.com/gke-metrics-agent-scaling-level=50
- To add 100 MB:
cloud.google.com/gke-metrics-agent-scaling-level=100
- To add 200 MB:
cloud.google.com/gke-metrics-agent-scaling-level=200
- To add 500 MB:
cloud.google.com/gke-metrics-agent-scaling-level=500
- To add 10 MB:
ZONE
: the zone in which the cluster is running.
Alternatively, you can add add a temporary node label that will not persist after an upgrade by using the following command:
kubectl label node/NODE_NAME \ ADDITIONAL_MEMORY_NODE_LABEL --overwrite
Replace the following:
NODE_NAME
with the name of the node of the affected metrics agent.ADDITIONAL_MEMORY_NODE_LABEL
: one of the additional memory node labels; use one one of the values from the preceding example.
Control plane metrics
You can configure a GKE cluster to send certain metrics emitted by the Kubernetes API server, Scheduler, and Controller Manager to Cloud Monitoring.
Requirements
Sending metrics emitted by Kubernetes control plane components to Cloud Monitoring requires GKE control plane version 1.22.13 or later and requires that the collection of system metrics be enabled.
Configuring collection of control plane metrics
To enable Kubernetes control plane metrics in an existing GKE cluster, follow these steps:
Console
You can enable control plane metrics for a cluster either from the Observability tab for the cluster or from Details tab for the cluster. When you use the Observability tab, you can preview the available charts and metrics before you enable the metric package.
To enable control plane metrics from the Observability tab for the cluster, do the following:
In the Google Cloud console, go to the list of GKE clusters:
Click your cluster's name and then select the Observability tab.
Select Control Plane from the list of features.
Click Enable package.
If the control plane metrics are already enabled, then you see a set of charts for control plane metrics instead.
To enable control plane metrics from the Details tab for the cluster, do the following:
In the Google Cloud console, go to the list of GKE clusters:
Click your cluster's name.
In the Features row labelled Cloud Monitoring, click the Edit icon.
In the Edit Cloud Monitoring dialog box that appears, confirm that Enable Cloud Monitoring is selected.
In the Components dropdown menu, select the control plane components from which you would like to collect metrics: API Server, Scheduler, or Controller Manager.
Click OK.
Click Save Changes.
gcloud
Open a terminal window with Google Cloud SDK and the Google Cloud CLI installed. One way to do this is to use Cloud Shell.
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Pass one or more of the values
API_SERVER
,SCHEDULER
, orCONTROLLER_MANAGER
to the--monitoring
flag of thegcloud container clusters create
orgcloud container clusters update
commands.For example, to collect metrics from the API server, scheduler, and controller manager, run this command:
gcloud container clusters update [CLUSTER_ID] \ --zone=[ZONE] \ --project=[PROJECT_ID] \ --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER
Terraform
To configure the collection of Kubernetes control plane metrics by using Terraform, see the
monitoring_config
block in the Terraform registry forgoogle_container_cluster
.For general information about using Google Cloud with Terraform, see Terraform with Google Cloud.
Using control plane metrics
See Use control plane metrics for the following:
Lists of control plane metrics for API server, scheduler, and controller manager.
Guidance and best practices for using the API server metrics and the scheduler metrics.
Dashboards for visualizing control plane metrics available on the the GKE Observability tab in the Google Cloud console. For information about these dashboards, see View observability metrics.
Pricing
GKE control plane metrics uses Google Cloud Managed Service for Prometheus to ingest metrics into Cloud Monitoring. Cloud Monitoring charges for the ingestion of GKE control plane metrics based on the number of samples ingested. For more information, see Cloud Monitoring pricing.
Understanding your Monitoring bill
To identify which control plane metrics have the largest number of samples being
ingested, use the
monitoring.googleapis.com/collection/attribution/write_sample_count
metric:
In the Google Cloud console, select Monitoring:
In the Monitoring navigation pane, click Metrics Explorer.
In the Metric field, select
monitoring.googleapis.com/collection/attribution/write_sample_count
.Click Add Filter.
In the Label field, select
attribution_dimension
.In the Comparison field, select
= (equals)
.In the Value field, enter
cluster
.Click Done.
Optionally, filter for only certain metrics. In particular, since API server metrics all include "apiserver" as part of the metric name and since Scheduler metrics all include "scheduler" as part of the metric name, you can restrict to metrics containing those strings:
Click Add Filter.
In the Label field, select
metric_type
.In the Comparison field, select
=~ (equals regex)
.In the Value field, enter
.*apiserver.*
or.*scheduler.*
.Click Done.
Optionally, group the number of samples ingested by GKE region or project:
Click Group by.
Ensure metric_type is selected.
To group by GKE region, select location.
To group by project, select project_id.
Click OK.
Optionally, group the number of samples ingested by GKE cluster name:
Click Group by.
To group by GKE cluster name, ensure both attribution_dimension and attribution_id are selected.
Click OK.
Sort the list of metrics in descending order by clicking the column header Value above the list of metrics.
These steps show the metrics with the highest rate of samples ingested into Cloud Monitoring. Since GKE control plane metrics are charged by the number of samples ingested, pay attention to metrics with the greatest rate of samples being ingested.
Quota
Control plane metrics consume the "Time series ingestion requests per minute" quota of the Cloud Monitoring API. Before enabling control plane metrics, check your recent peak usage of that quota. If you have many clusters in the same project or are already approaching that quota's limit, you can request a quota-limit increase before enabling control plane metrics.
Other metrics
In addition to the system metrics and control plane metrics in this document, Istio metrics are also available for GKE clusters.