Cost controls and attribution

Google Cloud Managed Service for Prometheus charges for the number of samples ingested into Cloud Monitoring and for read requests to the Monitoring API. The number of samples ingested is the primary contributor to your cost.

This document describes how you can control costs associated with metric ingestion and how to identify sources of high-volume ingestion.

For more information about the pricing for Managed Service for Prometheus, see Managed Service for Prometheus pricing summary.

View your bill

To view your Google Cloud bill, do the following:

  1. In the Google Cloud Console, go to the Billing page.

    Go to Billing

  2. If you have more than one billing account, select Go to linked billing account to view the current project's billing account. To locate a different billing account, select Manage billing accounts and choose the account for which you'd like to get usage reports.

  3. Select Reports.

  4. From the Services menu, select the Stackdriver Monitoring option.

  5. From the SKUs menu, select the following options:

    • Managed Service for Prometheus Samples Ingested
    • Monitoring API Requests

The following screenshot shows the billing report for Managed Service for Prometheus from one project:

The billing report for Managed Service for Prometheus shows current and
projected usage.

Reduce your costs

To reduce the costs associated with using Managed Service for Prometheus, you can do the following:

  • Reduce the number of time series you send to the managed service by filtering the metric data you generate.
  • Reduce the number of samples that you collect by changing the scraping interval.
  • Limit the number of samples from potentially misconfigured high-cardinality metrics.

Reduce the number of time series

Open source Prometheus documentation rarely recommends filtering metric volume, which is reasonable when costs are bounded by machine costs. But when paying a managed-service provider on a unit basis, sending unlimited data can cause unnecessarily high bills.

The exporters included in the kube-prometheus project—the kube-state-metrics service in particular—can emit a lot of metric data. For example, the kube-state-metrics service emits hundreds of metrics, many of which might be completely valueless to you as a consumer. A fresh three-node cluster using the kube-prometheus project sends approximately 900 samples per second to Managed Service for Prometheus. Filtering these extraneous metrics might be enough by itself to get your bill down to an acceptable level.

To reduce the number of metrics, you can do the following:

For example, if you are using the kube-state-metrics service, you might want to start with a keep filter in your scrape config and then adjust it. For example, using the following filter on a fresh three-node cluster reduces your sample volume by approximately 125 samples per second:

kube_(daemonset|deployment|pod|namespace|node|statefulset).+

Sometimes, you might find an entire exporter to be unimportant. For example, the kube-prometheus package installs the following service monitors by default, many of which are unnecessary in a managed environment:

  • alertmanager
  • coredns
  • grafana
  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler
  • kube-state-metrics
  • kubelet
  • node-exporter
  • prometheus
  • prometheus-adapter
  • prometheus-operator

To reduce the number of metrics that you export, you can delete, disable, or stop scraping the service monitors you don't need. For example, disabling the kube-apiserver service monitor on a fresh three-node cluster reduces your sample volume by approximately 200 samples per second.

Reduce the number of samples collected

Managed Service for Prometheus charges on a per-sample basis. You can reduce the number of samples ingested by increasing the length of the sampling period. For example:

  • Changing a 10-second sampling period to a 30-second sampling period can reduce your sample volume by 66%, without much loss of information.
  • Changing a 10-second sampling period to a 60-second sampling period can reduce your sample volume by 83%.

For information about how samples are counted and how the sampling period affects the number of samples, see Pricing examples based on samples ingested.

You can usually set the scraping interval on a per-job or a per-target basis.

For managed collection, you set the scrape interval in the PodMonitoring resource by using the interval field. For self-deployed collection, you set the sampling interval in your scrape configs, usually by setting an interval or scrape_interval field.

Limit samples from high-cardinality metrics

You can create high-cardinality metrics by adding labels that have a large number of potential values, like a user ID or IP address. Such metrics can generate a very large number of samples. Using labels with a large number of values is typically a misconfiguration. You can guard against high-cardinality metrics in your self-deployed collectors by setting a sample_limit value in your scrape configs.

If you use this limit, we recommend that you set it to a very high value, so that it only catches obviously misconfigured metrics. Any samples over the limit are dropped, and it can be very hard to diagnose issues caused by exceeding the limit.

Using a sample limit is not a good way to manage sample ingestion, but the limit can protect you against accidental misconfiguration. For more information, see Using sample_limit to avoid overload.

Identify and attribute costs

You can use Cloud Monitoring to identify the Prometheus metrics that are writing the largest numbers of samples. These metrics are contributing the most to your costs. After you identify the most expensive metrics, you can modify your scrape configs to filter these metrics appropriately.

The following sections describe ways to analyze the number of samples that you are sending to Managed Service for Prometheus and attribute high volume to specific metrics, Kubernetes namespaces, and Google Cloud regions.

Identify high-volume metrics

To identify the Prometheus metrics with the largest ingestion volumes, do the following:

  1. In the Google Cloud Console, go to the Monitoring page.

    Go to Monitoring

  2. In the Monitoring navigation pane, click Metrics Explorer.
  3. Select the Configuration tab, and then use the following information to complete the fields:
    1. For the Resource type field, enter or select Metric Ingestion Attribution.
    2. For the Metric field, enter or select Samples written by attribution id.
    3. For the Group by field, select metric_type.
    4. For the Aggregator field, select sum.

    The chart now shows the ingestion volumes for each metric type.

  4. To identify the metrics with the largest ingestion volumes, click Value in the chart legend.

The resulting chart, which shows your 300 top metrics by volume ranked by mean, looks like the following screenshot:

The configured chart shows volume of metric ingestion for each
metric.

Identify high-volume namespaces

You can also use the metric and resource types from the prior example to attribute ingestion volume to specific Kubernetes namespaces and then take appropriate action. For example:

  • To correlate overall ingestion volume with namespaces, select the following labels for the Group by field:

    • attribution_dimension
    • attribution_id
  • To correlate ingestion volume of individual metrics with namespaces, select the following labels for the Group by field:

    • attribution_dimension
    • attribution_id
    • metric_type
  • To identify the namespaces responsible for a specific high-volume metric:

    1. Identify the metric type for the high-volume metric by using one of the other examples to identify high-volume metric types. The metric type is the string in the chart legend that begins with prometheus.googleapis.com/.
    2. To restrict the chart data to a specific metric type, add a filter for the metric type in the Filters field. For example:

      metric_type=prometheus.googleapis.com/container_tasks_state/gauge

    3. Select the following labels for the Group by field:

      • attribution_dimension
      • attribution_id
  • To see ingestion by Google Cloud region, add the location label to the Group by field.

  • To see ingestion by Cloud project, add the resource_container label to the Group by field.