Troubleshooting Managed Service for Prometheus

This document describes some problems you might encounter when using Google Cloud Managed Service for Prometheus and provides information on diagnosing and resolving the problems.

You configured Managed Service for Prometheus but are not seeing any metric data in Grafana or the Prometheus UI. At a high level, the cause might be either of the following:

  • A problem on the query side, so that data can't be read. Query-side problems are often caused by incorrect permissions on the service account reading the data or by misconfiguration of Grafana.

  • A problem on the ingestion side, so that no data is sent. Ingestion-side problems can be caused by configuration problems with service accounts, collectors, or rule evaluation.

To determine whether the problem is on the ingestion side or the query side, try querying data by using the Metrics Explorer PromQL tab in the Google Cloud console. This page is guaranteed not to have any issues with read permissions or Grafana settings.

To view this page, do the following:

  1. Use the Google Cloud console project picker to select the project for which you are not seeing data.

  2. In the Google Cloud console, go to the  Metrics explorer page:

    Go to Metrics explorer

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  3. In the toolbar of the query-builder pane, select the button whose name is either  MQL or  PromQL.

  4. Verify that PromQL is selected in the Language toggle. The language toggle is in the same toolbar that lets you format your query.

  5. Enter the following query into the editor, and then click Run query:

    up
    

If you query the up metric and see results, then the problem is on the query side. For information on resolving these problems, see Query-side problems.

If you query the up metric and do not see any results, then the problem is on the ingestion side. For information on resolving these problems, see Ingestion-side problems.

A firewall can also cause ingestion and query problems; for more information, see Firewalls.

The Cloud Monitoring Metrics Management page provides information that can help you control the amount you spend on billable metrics without affecting observability. The Metrics Management page reports the following information:

  • Ingestion volumes for both byte- and sample-based billing, across metric domains and for individual metrics.
  • Data about labels and cardinality of metrics.
  • Number of reads for each metric.
  • Use of metrics in alerting policies and custom dashboards.
  • Rate of metric-write errors.

You can also use the Metrics Management to exclude unneeded metrics, eliminating the cost of ingesting them.

To view the Metrics Management page, do the following:

  1. In the Google Cloud console, go to the  Metrics management page:

    Go to Metrics management

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  2. In the toolbar, select your time window. By default, the Metrics Management page displays information about the metrics collected in the previous one day.

For more information about the Metrics Management page, see View and manage metric usage.

Query-side problems

The cause of most query-side problems is one of the following:

Start by doing the following:

  • Check your configuration carefully against the setup instructions for querying.

  • If you are using Workload Identity Federation for GKE, verify that your service account has the correct permissions by doing the following;

    1. In the Google Cloud console, go to the IAM page:

      Go to IAM

      If you use the search bar to find this page, then select the result whose subheading is IAM & Admin.

    2. Identify the service account name in the list of principals. Verify that the name of the service account is correctly spelled. Then click Edit.

    3. Select the Role field, then click Currently used and search for the Monitoring Viewer role. If the service account doesn't have this role, add it now.

If the problem still persists, then consider the following possibilities:

Misconfigured or mistyped secrets

If you see any of the following, then you might have a missing or mistyped secret:

  • One of these "forbidden" errors in Grafana or the Prometheus UI:

    • "Warning: Unexpected response status when fetching server time: Forbidden"
    • "Warning: Error fetching metrics list: Unexpected response status when fetching metric names: Forbidden"
  • A message like this in your logs:
    "cannot read credentials file: open /gmp/key.json: no such file or directory"

If you are using the data source syncer to authenticate and configure Grafana, try the following to resolve these errors:

  1. Verify that you have chosen the correct Grafana API endpoint, Grafana data source UID, and Grafana API token. You can inspect the variables in the CronJob by running the command kubectl describe cronjob datasource-syncer.

  2. Verify that you have set the data source syncer's project ID to the same metrics scope or project that your service account has credentials for.

  3. Verify that your Grafana service account has the "Admin" role and that your API token has not expired.

  4. Verify that your service account has the Monitoring Viewer role for the chosen project ID.

  5. Verify that there are no errors in the logs for the data source syncer Job by running kubectl logs job.batch/datasource-syncer-init. This command has to be run immediately after applying the datasource-syncer.yaml file.

  6. If using Workload Identity Federation for GKE, verify that you have not mistyped the account key or credentials, and verify that you have bound it to the correct namespace.

If you are using the legacy frontend UI proxy, try the following to resolve these errors:

  1. Verify that you have set the frontend UI's project ID to the same metrics scope or project that your service account has credentials for.

  2. Verify the project ID you've specified for any --query.project-id flags.

  3. Verify that your service account has the Monitoring Viewer role for the chosen project ID.

  4. Verify you have set the correct project ID when deploying the frontend UI and did not leave it set to the literal string PROJECT_ID.

  5. If using Workload Identity, verify that you have not mistyped the account key or credentials, and verify that you have bound it to the correct namespace.

  6. If mounting your own secret, make sure the secret is present:

    kubectl get secret gmp-test-sa -o json | jq '.data | keys'
    
  7. Verify that the secret is correctly mounted:

    kubectl get deploy frontend -o json | jq .spec.template.spec.volumes
    
    kubectl get deploy frontend -o json | jq .spec.template.spec.containers[].volumeMounts
    
  8. Make sure the secret is passed correctly to the container:

    kubectl get deploy frontend -o json | jq .spec.template.spec.containers[].args
    

Incorrect HTTP method for Grafana

If you see the following API error from Grafana, then Grafana is configured to send a POST request instead of a GET request:

  • "{"status":"error","errorType":"bad_data","error":"no match[] parameter provided"}%"

To resolve this issue, configure Grafana to use a GET request by following the instructions in Configure a data source.

Timeouts on large or long-running queries

If you see the following error in Grafana, then your default query timeout is too low:

  • "Post "http://frontend.NAMESPACE_NAME.svc:9090/api/v1/query_range": net/http: timeout awaiting response headers"

Managed Service for Prometheus does not time out until a query exceeds 120 seconds, while Grafana times out after 30 seconds by default. To fix this, raise the timeouts in Grafana to 120 seconds by following the instructions in Configure a data source.

Label-validation errors

If you see one of the following errors in Grafana, then you might be using an unsupported endpoint:

  • "Validation: labels other than name are not supported yet"
  • "Templating [job]: Error updating options: labels other than name are not supported yet."

Managed Service for Prometheus supports the /api/v1/$label/values endpoint only for the __name__ label. This limitation causes queries using the label_values($label) variable in Grafana to fail.

Instead, use the label_values($metric, $label) form. This query is recommended because it constrains the returned label values by metric, which prevents retrieval of values not related to the dashboard's contents. This query calls a supported endpoint for Prometheus.

For more information about supported endpoints, see API compatibility.

Quota exceeded

If you see the following error, then you have exceeded your read quota for the Cloud Monitoring API:

  • "429: RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'Time series queries ' and limit 'Time series queries per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:...'."

To resolve this issue, submit a request to increase your read quota for the Monitoring API. For assistance, contact Google Cloud Support. For more information about quotas, see Working with quotas.

Metrics from multiple projects

If you want to view metrics from multiple Google Cloud projects, you don't have to configure multiple data source syncers or create multiple data sources in Grafana.

Instead, create a Cloud Monitoring metrics scope in one Google Cloud project — the scoping project — that contains the projects you want to monitor. When you configure the Grafana data source with a scoping project, you get access to the data from all projects in the metrics scope. For more information, see Queries and metrics scopes.

No monitored resource type specified

If you see the following error, then you need to specify a monitored resource type when using PromQL to query a Google Cloud system metric:

  • "metric is configured to be used with more than one monitored resource type; series selector must specify a label matcher on monitored resource name"

You can specify a monitored resource type by filtering using the monitored_resource label. For more information about identifying and choosing a valid monitored resource type, see Specifying a monitored resource type.

Counter, histogram, and summary raw values not matching between the collector UI and the Google Cloud console

You might notice a difference between the values in the local collector Prometheus UI and the Google Cloud Google Cloud console when querying the raw value of cumulative Prometheus metrics, including counters, histograms, and summaries. This behavior is expected.

Monarch requires start timestamps, but Prometheus doesn't have start timestamps. Managed Service for Prometheus generates start timestamps by skipping the first ingested point in any time series and converting it into a start timestamp. Subsequent points have the value of the initial skipped point subtracted from their value to ensure rates are correct. This causes a persistent deficit in the raw value of those points.

The difference between the number in the collector UI and the number in the Google Cloud console is equal to the first value recorded in the collector UI, which is expected because the system skips that initial value, and subtracts it from subsequent points.

This is acceptable because there's no production need for running a query for raw values for cumulative metrics; all useful queries require a rate() function or the like, in which case the difference over any time horizon is identical between the two UIs. Cumulative metrics only ever increase, so you can't set an alert on a raw query as a time series only ever hits a threshold one time. All useful alerts and charts look at the change or the rate of change in the value.

The collector only holds about 10 minutes of data locally. Discrepancies in raw cumulative values might also arise due to a reset happening before the 10 minute horizon. To rule out this possibility, try setting only a 10 minute query lookback period when comparing the collector UI to the Google Cloud console.

Discrepancies can also be caused by having multiple worker threads in your application, each with a /metrics endpoint. If your application spins up multiple threads, you have to put the Prometheus client library in multiprocess mode. For more information, see the documentation for using multiprocess mode in Prometheus' Python client library.

Missing counter data or broken histograms

The most common signal of this problem is seeing no data or seeing data gaps when querying a plain counter metric (for example, a PromQL query of metric_name_foo). You can confirm this if data appears after you add a rate function to your query (for example, rate(metric_name_foo[5m])).

You might also notice that your samples ingested has risen sharply without any major change in scrape volume or that new metrics are being created with "unknown" or "unknown:counter" suffixes in Cloud Monitoring.

You might also notice that histogram operations, such as the quantile() function, don't work as expected.

These issues occur when a metric is collected without a Prometheus metric TYPE. As Monarch is strongly typed, Managed Service for Prometheus accounts for untyped metrics suffixing them with "unknown" and ingesting them twice, once as a gauge and once as a counter. The query engine then chooses whether to query the underlying gauge or counter metric based on what query functions you use.

While this heuristic usually works quite well, it can lead to issues such as strange results when querying a raw "unknown:counter" metric. Also, as histograms are specifically typed objects in Monarch, ingesting the three required histogram metrics as individual counter metrics causes histogram functions to not work. As "unknown"-typed metrics are ingested twice, not setting a TYPE doubles your samples ingested.

Common reasons why TYPE might not be set include:

  • Accidentally configuring a Managed Service for Prometheus collector as a federation server. Federation is not supported when using Managed Service for Prometheus. As federation intentionally drops TYPE information, implementing federation causes "unknown"-typed metrics.
  • Using Prometheus Remote Write at any point in the ingestion pipeline. This protocol also intentionally drops TYPE information.
  • Using a relabeling rule that modifies the metric name. This causes the renamed metric to disassociate from the TYPE information associated with the original metric name.
  • The exporter not emitting a TYPE for each metric.
  • A transient issue where TYPE is dropped when the collector first starts up.

To resolve this issue, do the following:

  • Stop using federation with Managed Service for Prometheus. If you want to reduce cardinality and cost by "rolling up" data before sending it to Monarch, see Configure local aggregation.
  • Stop using Prometheus Remote Write in your collection path.
  • Confirm that the # TYPE field exists for each metric by visiting the /metrics endpoint.
  • Delete any relabeling rules that modify the name of a metric.
  • Delete any conflicting metrics with the "unknown" or "unknown:counter" suffix by calling DeleteMetricDescriptor.
  • Or always query counters using a rate or other counter-processing function.

You can also create a metric-exclusion rule within Metrics Management to prevent any "unknown"-suffixed metrics from being ingested by using the regular expression prometheus.googleapis.com/.+/unknown.*. If you don't fix the underlying issue before installing this rule, you might prevent wanted metric data from being ingested.

Grafana data not persisted after pod restart

If your data appears to vanish from Grafana after a pod restart but is visible in Cloud Monitoring, then you are using Grafana to query the local Prometheus instance instead of Managed Service for Prometheus.

For information about configuring Grafana to use the managed service as a data source, see Grafana.

Importing Grafana dashboards

For information about using and troubleshooting the dashboard importer, see Import Grafana dashboards into Cloud Monitoring.

For information about problems with the conversion of the dashboard contents, see the importer's README file.

Ingestion-side problems

Ingestion-side problems can be related to either collection or rule evaluation. Start by looking at the error logs for managed collection. You can run the following commands:

kubectl logs -f -n gmp-system -lapp.kubernetes.io/part-of=gmp

kubectl logs -f -n gmp-system -lapp.kubernetes.io/name=collector -c prometheus

On GKE Autopilot clusters, you can run the following commands:

kubectl logs -f -n gke-gmp-system -lapp.kubernetes.io/part-of=gmp

kubectl logs -f -n gke-gmp-system -lapp.kubernetes.io/name=collector -c prometheus

The target status feature can help you debug your scrape target. For more information, see target status information.

Endpoint status is missing or too old

If you have enabled the target status feature but one or more of your PodMonitoring or ClusterPodMonitoring resources are missing the Status.Endpoint Statuses field or value, then you might have one of the following problems:

  • Managed Service for Prometheus was unable to reach a collector on the same node as one of your endpoints.
  • One or more of your PodMonitoring or ClusterPodMonitoring configs resulted in no valid targets.

Similar problems can also cause the Status.Endpoint Statuses.Last Update Time field to have value older than a few minutes plus your scrape interval.

To resolve this issue, start by checking that the Kubernetes pods associated with your scrape endpoint are running. If your Kubernetes pods are running, the label selectors match, and you can manually access the scrape endpoints (typically by visiting the /metrics endpoint), then check whether the Managed Service for Prometheus collectors are running.

Collectors fraction is less than 1

If you have enabled the target status feature, then you get status information about your resources. The Status.Endpoint Statuses.Collectors Fraction value of your PodMonitoring or ClusterPodMonitoring resources represents the fraction of collectors, expressed from 0 to 1, that are reachable. For example, a value of 0.5 indicates that 50% of your collectors are reachable, while a value of 1 indicates that 100% of your collectors are reachable.

If the Collectors Fraction field has a value other than 1, then one or more collectors are unreachable, and metrics in any of those nodes are possibly not being scraped. Ensure that all collectors are running and reachable over the cluster network. You can view the status of collector pods with the following command:

kubectl -n gmp-system get pods --selector="app.kubernetes.io/name=collector"

On GKE Autopilot clusters, this command looks slightly different:

kubectl -n gke-gmp-system get pods --selector="app.kubernetes.io/name=collector"

You can investigate individual collector pods (for example, a collector pod named collector-12345) with the following command:

kubectl -n gmp-system describe pods/collector-12345

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system describe pods/collector-12345

If collectors are not healthy, see GKE workload troubleshooting.

If the collectors are healthy, then check the operator logs. To check the operator logs, first run the following command to find the operator pod name:

kubectl -n gmp-system get pods --selector="app.kubernetes.io/name=gmp-collector"

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system get pods --selector="app.kubernetes.io/name=gmp-collector"

Then, check the operator logs (for example, an operator pod named gmp-operator-12345) with the following command:

kubectl -n gmp-system logs pods/gmp-operator-12345

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system logs pods/gmp-operator-12345

Unhealthy targets

If you have enabled the target status feature, but one or more of your PodMonitoring or ClusterPodMonitoring resources has the Status.Endpoint Statuses.Unhealthy Targets field with the value other than 0, then the collector cannot scrape one or more of your targets.

View the Sample Groups field, which groups targets by error message, and find the Last Error field. The Last Error field comes from Prometheus and tells you why the target was unable to be scraped. To resolve this issue, using the sample targets as a reference, check whether your scrape endpoints are running.

Unauthorized scrape endpoint

If you see one of the following errors and your scrape target requires authorization, then your collector is either not set up to use the correct authorization type or is using the incorrect authorization payload:

  • server returned HTTP status 401 Unauthorized
  • x509: certificate signed by unknown authority

To resolve this issue, see Configuring an authorized scrape endpoint.

Quota exceeded

If you see the following error, then you have exceeded your ingestion quota for the Cloud Monitoring API:

  • "429: Quota exceeded for quota metric 'Time series ingestion requests' and limit 'Time series ingestion requests per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:PROJECT_NUMBER'., rateLimitExceeded"

This error is most commonly seen when first bringing up the managed service. The default quota will be exhausted at 100,000 samples per second ingested.

To resolve this issue, submit a request to increase your ingestion quota for the Monitoring API. For assistance, contact Google Cloud Support. For more information about quotas, see Working with quotas.

Missing permission on the node's default service account

If you see one of the following errors, then the default service account on the node might be missing permissions:

  • "execute query: Error querying Prometheus: client_error: client error: 403"
  • "Readiness probe failed: HTTP probe failed with statuscode: 503"
  • "Error querying Prometheus instance"

Managed collection and the managed rule evaluator in Managed Service for Prometheus both use the default service account on the node. This account is created with all the necessary permissions, but customers sometimes manually remove the Monitoring permissions. This removal causes collection and rule evaluation to fail.

To verify the permissions of the service account, do one of the following:

  • Identify the underlying Compute Engine node name, and then run the following command:

    gcloud compute instances describe NODE_NAME --format="json" | jq .serviceAccounts
    

    Look for the string https://www.googleapis.com/auth/monitoring. If necessary, add Monitoring as described in Misconfigured service account.

  • Navigate to the underlying VM in the cluster and check the configuration of the node's service account:

    1. In the Google Cloud console, go to the Kubernetes clusters page:

      Go to Kubernetes clusters

      If you use the search bar to find this page, then select the result whose subheading is Kubernetes Engine.

    2. Select Nodes, then click on the name of the node in the Nodes table.

    3. Click Details.

    4. Click the VM Instance link.

    5. Locate the API and identity management pane, and click Show details.

    6. Look for Stackdriver Monitoring API with full access.

It's also possible that the data source syncer or the Prometheus UI has been configured to look at the wrong project. For information about verifying that you are querying the intended metrics scope, see Change the queried project.

Misconfigured service account

If you see one of the following error messages, then the service account used by the collector does not have the correct permissions:

  • "code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist)"
  • "google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information."

To verify that your service account has the correct permissions, do the following:

  1. In the Google Cloud console, go to the IAM page:

    Go to IAM

    If you use the search bar to find this page, then select the result whose subheading is IAM & Admin.

  2. Identify the service account name in the list of principals. Verify that the name of the service account is correctly spelled. Then click Edit.

  3. Select the Role field, then click Currently used and search for the Monitoring Metric Writer or the Monitoring Editor role. If the service account doesn't have one of these roles, then grant the service account the role Monitoring Metric Writer (roles/monitoring.metricWriter).

If you are running on non-GKE Kubernetes, then you must explicitly pass credentials to both the collector and the rule evaluator. You must repeat the credentials in both the rules and collection sections. For more information, see Provide credentials explicitly (for collection) or Provide credentials explicitly (for rules).

Service accounts are often scoped to a single Google Cloud project. Using one service account to write metric data for multiple projects — for example, when one managed rule evaluator is querying a multi-project metrics scope — can cause this permission error. If you are using the default service account, consider configuring a dedicated service account so that you can safely add the monitoring.timeSeries.create permission for several projects. If you can't grant this permission, then you can use metric relabeling to rewrite the project_id label to another name. The project ID then defaults to the Google Cloud project in which your Prometheus server or rule evaluator is running.

Invalid scrape configuration

If you see the following error, then your PodMonitoring or ClusterPodMonitoring is improperly formed:

  • "Internal error occurred: failed calling webhook "validate.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com": Post "https://gmp-operator.gmp-system.svc:443/validate/monitoring.googleapis.com/v1/podmonitorings?timeout=10s": EOF""

To solve this, make sure your custom resource is properly formed according to the specification.

Admission webhook unable to parse or invalid HTTP client config

On versions of Managed Service for Prometheus earlier than 0.12, you might see an error similar to the following, which is related to secret injection in the non-default namespace:

  • "admission webhook "validate.podmonitorings.gmp-operator.gmp-system.monitoring.googleapis.com" denied the request: invalid definition for endpoint with index 0: unable to parse or invalid Prometheus HTTP client config: must use namespace "my-custom-namespace", got: "default""

To solve this issue, upgrade to version 0.12 or later.

Problems with scrape intervals and timeouts

When using Managed Service for Prometheus, the scrape timeout can't be greater than the scrape interval. To check your logs for this problem, run the following command:

kubectl -n gmp-system logs ds/collector prometheus

On GKE Autopilot clusters, run the following command:

kubectl -n gke-gmp-system logs ds/collector prometheus

Look for this message:

  • "scrape timeout greater than scrape interval for scrape config with job name "PodMonitoring/gmp-system/example-app/go-metrics""

To resolve this issue, set the value of the scrape interval equal to or greater than the value of the scrape timeout.

Missing TYPE on metric

If you see the following error, then the metric is missing type information:

  • "no metadata found for metric name "{metric_name}""

To verify that missing type information is the problem, check the /metrics output of the exporting application. If there is no line like the following, then the type information is missing:

# TYPE {metric_name} <type>

Certain libraries, such as those from VictoriaMetrics older than version 1.28.0, intentionally drop the type information. These libraries are not supported by Managed Service for Prometheus.

Time-series collisions

If you see one of the following errors, you might have more than one collector attempting to write to the same time series:

  • "One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric."
  • "One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point."

The most common causes and solutions follow:

  • Using high-availability pairs. Managed Service for Prometheus does not support traditional high-availability collection. Using this configuration can create multiple collectors that try to write data to the same time series, causing this error.

    To resolve the problem, disable the duplicate collectors by reducing the replica count to 1, or use the supported high-availability method.

  • Using relabeling rules, particularly those that operate on jobs or instances. Managed Service for Prometheus partially identifies a unique time series by the combination of {project_id, location, cluster, namespace, job, instance} labels. Using a relabeling rule to drop these labels, especially the job and instance labels, can frequently cause collisions. Rewriting these labels is not recommended.

    To resolve the problem, delete the rule that is causing it; this can be often done by metricRelabeling rule that uses the labeldrop action. You can identify the problematic rule by commenting out all the relabeling rules and then reinstating them, one at a time, until the error recurs.

A less common cause of time-series collisions is using a scrape interval shorter than 5 seconds. The minimum scrape interval supported by Managed Service for Prometheus is 5 seconds.

Exceeding the limit on the number of labels

If you see the following error, then you might have too many labels defined for one of your metrics:

  • "One or more TimeSeries could not be written: The new labels would cause the metric prometheus.googleapis.com/METRIC_NAME to have over PER_PROJECT_LIMIT labels".

This error usually occurs when you rapidly change the definition of the metric so that one metric name effectively has multiple independent sets of label keys over the whole lifetime of your metric. The Cloud Monitoring imposes a limit on number of labels for each metric; for more information see the limits for user-defined metrics.

There are three steps to resolve this problem:

  1. Identify why a given metric has too many or frequently changing labels.

  2. Address the source of the problem, which might involve adjusting your PodMonitoring's relabeling rules, changing the exporter, or fixing your instrumentation.

  3. Delete the metric descriptor for this metric (which will incur data loss), so it can be recreated with a smaller, more stable set of labels. You can use the metricDescriptors.delete method to do so.

The most common sources of the problem are:

  • Collecting metrics from exporters or applications that attach dynamic labels on metrics. For example, self-deployed cAdvisor with additional container labels and environment variables or the DataDog agent, which injects dynamic annotations.

    To resolve this, you can use a metricRelabeling section on the PodMonitoring to either keep or drop labels. Some applications and exporters also allow configuration that changes exported metrics. For example, cAdvisor has a number of advanced runtime settings that can dynamically add labels. When using managed collection, we recommend using the built-in automatic kubelet collection.

  • Using relabeling rules, particularly those that attach label names dynamically, which can cause an unexpected number of labels.

    To resolve the problem, delete the rule entry that is causing it.

Rate limits on creating and updating metrics and labels

If you see the following error, then you have hit the per-minute rate limit on creating new metrics and adding new metric labels to existing metrics:

  • "Request throttled. You have hit the per-project limit on metric definition or label definition changes per minute."

This rate limit is usually only hit when first integrating with Managed Service for Prometheus, for example when you migrate an existing, mature Prometheus deployment to use self-deployed collection. This is not a rate limit on ingesting data points. This rate limit only applies when creating never-before-seen metrics or when adding new labels to existing metrics.

This quota is fixed, but any issues should automatically resolve as new metrics and metric labels get created up to the per-minute limit.

Limits on the number of metric descriptors

If you see the following error, then you have hit the quota limit for the number of metric descriptors within a single Google Cloud project:

  • "Your metric descriptor quota has been exhausted."

By default, this limit is set to 25,000. Although this quota can be lifted by request if your metrics are well-formed, it is far more likely that you hit this limit because you are ingesting malformed metric names into the system.

Prometheus has a dimensional data model where information such as cluster or namespace name should get encoded as a label value. When dimensional information is instead embedded in the metric name itself, then the number of metric descriptors increases indefinitely. In addition, because in this scenario labels are not properly used, it becomes much more difficult to query and aggregate data across clusters, namespaces, or services.

Neither Cloud Monitoring nor Managed Service for Prometheus supports non-dimensional metrics, such as those formatted for StatsD or Graphite. While most Prometheus exporters are configured correctly out-of-the-box, certain exporters, such as the StatsD exporter, the Vault exporter, or the Envoy Proxy that comes with Istio, must be explicitly configured to use labels instead of embedding information in the metric name. Examples of malformed metric names include:

  • request_path_____path_to_a_resource____istio_request_duration_milliseconds
  • envoy_cluster_grpc_method_name_failure
  • envoy_cluster_clustername_upstream_cx_connect_ms_bucket
  • vault_rollback_attempt_path_name_1700683024
  • service__________________________________________latency_bucket

To confirm this issue, do the following:

  1. Within Google Cloud console, select the Google Cloud project that is linked to the error.
  2. In the Google Cloud console, go to the  Metrics management page:

    Go to Metrics management

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  3. Confirm that the sum of Active plus Inactive metrics is over 25,000. In most situations, you should see a large number of Inactive metrics.
  4. Select "Inactive" in the Quick Filters panel, page through the list, and look for patterns.
  5. Select "Active" in the Quick Filters panel, sort by Samples billable volume descending, page through the list, and look for patterns.
  6. Sort by Samples billable volume ascending, page through the list, and look for patterns.

Alternatively, you can confirm this issue by using Metrics Explorer:

  1. Within Google Cloud console, select the Google Cloud project that is linked to the error.
  2. In the Google Cloud console, go to the  Metrics explorer page:

    Go to Metrics explorer

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  3. In the query builder, click select a metric, then clear the "Active" checkbox.
  4. Type "prometheus" into the search bar.
  5. Look for any patterns in the names of metrics.

Once you have identified the patterns that indicate malformed metrics, you can mitigate the issue by fixing the exporter at the source and then deleting the offending metric descriptors.

To prevent this issue from happening again, you must first configure the relevant exporter to no longer emit malformed metrics. We recommend consulting the documentation for your exporter for help. You can confirm you have fixed the problem by manually visiting the /metrics endpoint and inspecting the exported metric names.

You can then free up your quota by deleting the malformed metrics using the projects.metricDescriptors.delete method. To more easily iterate through the list of malformed metrics, we provide a Golang script you can use. This script accepts a regular expression that can identify your malformed metrics and deletes any metric descriptors that match the pattern. As metric deletion is irreversible, we strongly recommend first running the script using dry run mode.

No errors and no metrics

If you are using managed collection, you don't see any errors, but data is not appearing in Cloud Monitoring, then the most likely cause is that your metric exporters or scrape configurations are not configured correctly. Managed Service for Prometheus does not send any time series data unless you first apply a valid scrape configuration.

To identify whether this is the cause, try deploying the example application and example PodMonitoring resource. If you now see the up metric (it may take a few minutes), then the problem is with your scrape configuration or exporter.

The root cause could be any number of things. We recommend checking the following:

  • Your PodMonitoring references a valid port.

  • Your exporter's Deployment spec has properly named ports.

  • Your selectors (most commonly app) match on your Deployment and PodMonitoring resources.

  • You can see data at your expected endpoint and port by manually visiting it.

  • You have installed your PodMonitoring resource in the same namespace as the application you wish to scrape. Do not install any custom resources or applications in the gmp-system or gke-gmp-system namespace.

  • Your metric and label names match Prometheus' validating regular expression. Managed Service for Prometheus does not support label names that start with the _ character.

  • You are not using a set of filters that causes all data to be filtered out. Take extra care that you don't have conflicting filters when using a collection filter in the OperatorConfig resource.

  • If running outside of Google Cloud, project or project-id is set to a valid Google Cloud project and location is set to a valid Google Cloud region. You can't use global as a value for location.

  • Your metric is one of the four Prometheus metric types. Some libraries like Kube State Metrics expose OpenMetrics metric types like Info, Stateset and GaugeHistogram, but these metric types are not supported by Managed Service for Prometheus and are silently dropped.

Some metrics are missing for short-running targets

Google Cloud Managed Service for Prometheus is deployed and there are no configuration errors; however, some metrics are missing.

Determine the deployment that generates the partially missing metrics. If the deployment is a Google Kubernetes Engine' CronJob, then determine how long the job typically runs:

  1. Find the cron job deployment yaml file and find the status, which is list at the end of the file. The status in this example shows that the job ran for one minute:

      status:
        lastScheduleTime: "2024-04-03T16:20:00Z"
        lastSuccessfulTime: "2024-04-03T16:21:07Z"
    
  2. If the run time is less than five minutes, then the job isn't running long enough for the metric data to be consistently scraped.

    To resolve this situation, try the following:

    • Configure the job to ensure that it doesn't exit until at least five minutes have elapsed since the job started.

    • Configure the job to detect whether metrics have been scraped before exiting. This capability requires library support.

    • Consider creating a log based distribution-valued metric instead of collecting metric data. This approach is suggested when data is published at a low rate. For more information, see Log-based metrics.

  3. If the run time is longer than five minutes or if it is inconsistent, then see the Unhealthy targets section of this document.

Problems with collection from exporters

If your metrics from an exporter are not being ingested, check the following:

  • Verify that the exporter is working and exporting metrics by using the kubectl port-forward command.

    For example, to check that pods with the selector app.kubernetes.io/name=redis in the namespace test are emitting metrics at the /metrics endpoint on port 9121, you can port-forward as follows:

    kubectl port-forward "$(kubectl get pods -l app.kubernetes.io/name=redis -n test -o jsonpath='{.items[0].metadata.name}')" 9121
    

    Access the endpoint localhost:9121/metrics by using the browser or curl in another terminal session to verify that the metrics are being exposed by the exporter for scraping.

  • Check if you can query the metrics in the Google Cloud console but not Grafana. If so, then the problem is with Grafana, not the collection of your metrics.

  • Verify that the managed collector is able to scrape the exporter by inspecting the Prometheus web interface the collector exposes.

    1. Identify the managed collector running on the same node on which your exporter is running. For example, if your exporter is running on pods in the namespace test and the pods are labeled with app.kubernetes.io/name=redis, the following command identifies the managed collector running on the same node:

      kubectl get pods -l app=managed-prometheus-collector --field-selector="spec.nodeName=$(kubectl get pods -l app.kubernetes.io/name=redis -n test -o jsonpath='{.items[0].spec.nodeName}')" -n gmp-system -o jsonpath='{.items[0].metadata.name}'
      
    2. Set up port-forwarding from port 19090 of the managed collector:

      kubectl port-forward POD_NAME -n gmp-system 19090
      
    3. Navigate to the URL localhost:19090/targets to access the web interface. If the exporter is listed as one of the targets, then your managed collector is successfully scraping the exporter.

Collector Out Of Memory (OOM) errors

If you are using managed collection and encountering Out Of Memory (OOM) errors on your collectors, then consider enabling vertical pod autoscaling.

Operator has Out Of Memory (OOM) errors

If you are using managed collection and encountering Out Of Memory (OOM) errors on your operator, then consider disabling target status feature. The target status feature can cause operator performance issues in larger clusters.

Firewalls

A firewall can cause both ingestion and query problems. Your firewall must be configured to permit both POST and GET requests to the Monitoring API service, monitoring.googleapis.com, to allow ingestion and queries.

Error about concurrent edits

The error message "Too many concurrent edits to the project configuration" is usually transient, resolving after a few minutes. It is usually caused by removing a relabeling rule that affects many different metrics. The removal causes the formation of a queue of updates to the metric descriptors in your project. The error goes away when the queue is processed.

For more information, see Limits on creating and updating metrics and labels.

Queries blocked and cancelled by Monarch

If you see the following error, then you have hit the internal limit for the number of concurrent queries that can be run for any given project:

  • "internal: expanding series: generic::aborted: invalid status monarch::220: Cancelled due to the number of queries whose evaluation is blocked waiting for memory is 501, which is equal to or greater than the limit of 500."

To protect against abuse, the system enforces a hard limit on the number of queries from one project that can run concurrently within Monarch. With typical Prometheus usage, queries should be quick and this limit should never be reached.

You might hit this limit if you are issuing a lot of concurrent queries that run for a longer-than-expected time. Queries requesting more than 25 hours of data are usually slower to execute than queries requesting less than 25 hours of data, and the longer the query lookback, the slower the query is expected to be.

Typically this issue is triggered by running lots of long-lookback rules in an inefficient way. For example, you might have many rules that run once every minute and request a 4-week rate. If each of these rules takes a long time to run, it might eventually cause a backup of queries waiting to run for your project, which then causes Monarch to throttle queries.

To resolve this issue, you need to increase the evaluation interval of your long-lookback rules so that they're not running every 1 minute. Running a query for a 4-week rate every 1 minute is unnecessary; there are 40,320 minutes in 4 weeks, so each minute gives you almost no additional signal (your data will change at most by 1/40,320th). Using a 1 hour evaluation interval should be sufficient for a query that requests a 4-week rate.

Once you resolve the bottleneck caused by inefficient long-running queries executing too frequently, this issue should resolve itself.

Incompatible value types

If you see the following error upon ingestion or query, then you have a value type incompatibility in your metrics:

  • "Value type for metric prometheus.googleapis.com/metric_name/gauge must be INT64, but is DOUBLE"
  • "Value type for metric prometheus.googleapis.com/metric_name/gauge must be DOUBLE, but is INT64"
  • "One or more TimeSeries could not be written: Value type for metric prometheus.googleapis.com/target_info/gauge conflicts with the existing value type (INT64)"

You might see this error upon ingestion, as Monarch does not support writing DOUBLE-typed data to INT64-typed metrics nor does it support writing INT64-typed data to DOUBLE-typed metrics. You also might see this error when querying using a multi-project metrics scope, as Monarch cannot union DOUBLE-typed metrics in one project with INT64-typed metrics in another project.

This error only happens when you have OpenTelemetry collectors reporting data, and it is more likely to happen if you have both OpenTelemetry (using the googlemanagedprometheus exporter) and Prometheus reporting data for the same metric as commonly happens for the target_info metric.

The cause is likely one of the following:

  • You are collecting OTLP metrics, and the OTLP metric library changed its value type from DOUBLE to INT64, as happened with OpenTelemetry's Java metrics. The new version of the metric library is now incompatible with the metric value type created by the old version of the metric library.
  • You are collecting the target_info metric using both Prometheus and OpenTelemetry. Prometheus collects this metric as a DOUBLE, while OpenTelemetry collects this metric as an INT64. Your collectors are now writing two value types to the same metric in the same project, and only the collector that first created the metric descriptor is succeeding.
  • You are collecting target_info using OpenTelemetry as an INT64 in one project, and you are collecting target_info using Prometheus as a DOUBLE in another project. Adding both metrics to the same metrics scope, then querying that metric through the metrics scope, causes an invalid union between incompatible metric value types.

This problem can be solved by forcing all metric value types to DOUBLE by doing the following:

  1. Reconfigure your OpenTelemetry collectors to force all metrics to be a DOUBLE by enabling the feature-gate exporter.googlemanagedprometheus.intToDouble flag.
  2. Delete all INT64 metric descriptors and let them get recreated as a DOUBLE. You can use the delete_metric_descriptors.go script to automate this.

Following these steps deletes all data that is stored as an INT64 metric. There is no alternative to deleting the INT64 metrics that fully solves this problem.