Cost optimization for Google Cloud Observability

Google Cloud Observability consists of a collection of cloud-based managed services designed to provide deep observability into app and infrastructure services. One of the benefits of managed services on Google Cloud is that services are usage-based, which means you pay only for what you use. While this pricing model might provide a cost benefit when compared to standard software licensing, it might make it challenging to forecast cost. This solution describes the ways that you can understand your usage of these services and optimize your costs.

Pricing

Because Google Cloud Observability services are managed services, they let you focus on the insights they provide, rather than the infrastructure required to use these services. When you use these services, you don't have to individually pay for virtual machines, software licenses, security scanning, hardware maintenance, or space in a data center. The services provide a simple per-usage cost.

Costs include charges for Cloud Logging, Cloud Monitoring, and Cloud Trace. The Error Reporting logging product doesn't have a separate cost while still in beta, and you can use Cloud Profiler at no cost. For Error Reporting, you might incur minor costs if your errors are ingested by Logging.

You can also read a summary of all pricing information

Logging and error-reporting costs

Logging prices are based on the volume of chargeable logs ingested. This product pricing provides a simple per-GiB cost. This is a free allotment per month, and certain logs, such as Cloud Audit Logs, are non-chargeable.

Example product usage that generates cost through additional log volume include using:

  • Cloud Load Balancing
  • The Logging agent on Compute Engine
  • The Logging agent on Amazon Web Services (AWS)
  • The write operation in the Cloud Logging API

Monitoring costs

Monitoring prices are based on the volume of chargeable metrics ingested and the number of chargeable API calls. For example, non-Google Cloud metrics such as agent, custom, external, and AWS metrics are chargeable. The projects.timeSeries.list method in the Cloud Monitoring API is charged by API call while the remainder of the API usage is free. This is a free, metric-volume allotment per month, and many of the metrics, including all of the Google Cloud metrics, are non-chargeable. See Monitoring pricing for more information about which metrics are chargeable.

Example product usage that generates cost through metric volume and API calls includes using:

  • Monitoring custom metrics
  • The Monitoring agent on Compute Engine
  • The Monitoring agent on AWS
  • The read operation in the Monitoring API

Trace costs

Trace prices are based on the number of spans ingested and eventually scanned. Some Google Cloud services, such as the App Engine standard environment, automatically produce non-chargeable spans. There is a free allotment per month for Trace.

Example product usage that generates cost through spans ingested includes adding instrumentation for your:

  • Spans for App Engine apps outside of the default spans
  • Cloud Load Balancing
  • Custom apps

Usage

Understanding your usage can provide insight into which components generate cost. This helps you identify areas that might be appropriate for optimization.

Google Cloud bill analysis

The first step to understanding your usage is to review your Google Cloud bill and understand your costs. One way to gain insight is to use the Billing Reports available in the Google Cloud console.

The Reports page offers a useful range of filters to narrow the results by time, project, products, SKUs, and labels. You can use the Products filter to narrow the billing data to monitoring and logging costs.

Logging

Logging provides detailed lists of logs, current log volume, and projected monthly volume for each project. You can review these details for each project while reviewing your Logging charges on your Google Cloud bill. This makes it simple to view which logs have the highest volume and, therefore, contribute the most to the cost.

You can find the volume of logs ingested in your project on the Logs Storage page. The Logs Storage page provides a list of the logs and their volumes for the previous month and current month, as well as their projected volumes for the end of the month.

This analysis lets you develop insight into the usage for the logs in specific projects and how their volume has changed over time. You can use this analysis to identify which logs you should consider optimizing.

Monitoring

Monitoring, organized into metrics scopes, provides a detailed list of projects and previous, current, and projected metric volumes. Because a metrics scope might include more than one project, the volumes for each project are listed separately, as illustrated in the following image.

Workspace with multiple projects

Learn how to find the Monitoring usage details.

You can view the detailed graph of the metrics volume metric for each project in the Metrics Explorer in Monitoring, which provides insight into the volume of metrics ingested over time.

This analysis provides you with the Monitoring metric volumes for each project that you identified while reviewing your Monitoring charges on your Google Cloud bill. You can then review the specific metric volumes and understand which components are contributing the most volume and cost.

Trace

Trace provides a detailed view of the spans ingested for the current and previous month. You can review these details in the Google Cloud console for each project that you identify while reviewing your Trace charges on your Google Cloud bill.

This analysis provides you with the number of spans ingested for each project in a Workspace that you identified while reviewing your Trace charges on your Google Cloud bill. Then you can review the specific number of spans ingested and understand which projects and services are contributing the highest number of spans and cost.

Logging export

Logging provides a logging sink to export logs to Cloud Storage, BigQuery, and Pub/Sub.

For example, if you export all of your logs from Logging to BigQuery for long-term storage and analysis, you incur the BigQuery costs, including per-GiB storage, streaming inserts, and any query costs.

To understand the costs your exports are generating, consider the following steps:

  1. Find your logging sinks. Find out what, if any, logging sinks that you have enabled. For example, your project might already have several logging sinks that were created for different purposes, such as security operations or to meet regulatory requirements.
  2. Review your usage details. Review usage for the destination of the exports. For example, BigQuery table sizes for a BigQuery export or the bucket sizes for Cloud Storage exports.

Find your logging sinks

Your logging sinks might be at the project level (one or more sinks per project) or they might be at the Google Cloud organizational level, called aggregated exports. The sinks might include many projects' logs in the same Google Cloud organization.

You can view your logging sinks by looking at specific projects. The Google Cloud console provides a list of the sinks and the destination. To view your aggregated exports for your organization, you can use the gcloud logging sinks list --organization ORGANIZATION_ID command line.

Review your usage details

Monitoring provides a rich set of metrics not only for your apps, but also for your Google Cloud product usage. You can get the detailed usage metrics for Cloud Storage, BigQuery, and Pub/Sub by viewing the usage metrics in the Metrics Explorer.

Cloud Storage

By using the Metrics Explorer in Monitoring, you can view the storage size for your Cloud Storage buckets. Use the following values in the Metrics Explorer to view the storage size of your Cloud Storage buckets used for the logging sinks.

To view the metrics for a monitored resource by using the Metrics Explorer, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics explorer:

    Go to Metrics explorer

  2. In the Metric element, expand the Select a metric menu, enter Total bytes in the filter bar, and then use the submenus to select a specific resource type and metric:
    1. In the Active resources menu, select GCS Bucket.
    2. In the Active metric categories menu, select Storage.
    3. In the Active metrics menu, select Total bytes.
    4. Click Apply.
    The fully qualified name for this metric is storage.googleapis.com/storage/total_bytes.
  3. Configure how the data is viewed.
    • In the Filter element, click Add filter, and then select project_id. For the value, select your Google Cloud project ID.
    • In the Filter element, click Add filter, and then select bucket_name. For the value, select your Cloud Storage export bucket name.
    • In the Aggregation entry, set the first menu to Unaggregated.

    For more information about configuring a chart, see Select metrics when using Metrics Explorer.

Storage size of Cloud Storage buckets

The previous graph shows the size of the data in TB exported over time, which provides insight into the usage for the Logging export to Cloud Storage.

BigQuery

By using the Metrics Explorer in Monitoring, you can view the storage size for your BigQuery dataset. Use the following values in the Metrics Explorer to view the storage size of your BigQuery dataset used for the logging sink.

To view the metrics for a monitored resource by using the Metrics Explorer, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics explorer:

    Go to Metrics explorer

  2. In the Metric element, expand the Select a metric menu, enter Stored bytes in the filter bar, and then use the submenus to select a specific resource type and metric:
    1. In the Active resources menu, select BigQuery dataset.
    2. In the Active metric categories menu, select Storage.
    3. In the Active metrics menu, select Stored bytes.
    4. Click Apply.
    The fully qualified name for this metric is bigquery.googleapis.com/storage/stored_bytes.
  3. Configure how the data is viewed.
    • In the Filter element, click Add filter, and then select project_id. For the value, select your Google Cloud project ID.
    • In the Filter element, click Add filter, and then select dataset_id. For the value, select your BigQuery export dataset name.
    • In the Aggregation entry, set the first menu to Mean and the second menu to dataset_id.
    • In the Display pane, set the Widget type to Stacked bar chart.
    • Set the time window to be at least one day.

    For more information about configuring a chart, see Select metrics when using Metrics Explorer.

Storage size of BigQuery dataset.

The previous graph shows the size of the export dataset in TB over time, which provides insight into the usage for the Logging export to BigQuery.

Pub/Sub

By using the Metrics Explorer in Monitoring, you can view the number of messages and the sizes of the messages exported to Pub/Sub. Use the following values in the Metrics Explorer to view the storage size of your Pub/Sub topic used for the logging sink.

To view the metrics for a monitored resource by using the Metrics Explorer, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Metrics explorer:

    Go to Metrics explorer

  2. In the Metric element, expand the Select a metric menu, enter Byte cost in the filter bar, and then use the submenus to select a specific resource type and metric:
    1. In the Active resources menu, select Cloud Pub/Sub topic.
    2. In the Active metric categories menu, select Topic.
    3. In the Active metrics menu, select Topic byte cost.
    4. Click Apply.
    The fully qualified name for this metric is pubsub.googleapis.com/topic/byte_cost.
  3. Configure how the data is viewed.
    • In the Filter element, click Add filter, and then select project_id. For the value, select your Google Cloud project ID.
    • In the Filter element, click Add filter, and then select topic_id. For the value, select your Pub/Sub export topic name.
    • In the Aggregation entry, set the first menu to Unaggregated.

    For more information about configuring a chart, see Select metrics when using Metrics Explorer.

Storage size of Pub/Sub topic The previous graph shows the size of the data in KB exported over time, which provides insight into the usage for the Logging export to Pub/Sub.

Implementing cost controls

The following options describe potential ways to reduce your costs. Each option comes at the expense of limiting insight into your apps and infrastructure. Choose the option that provides you with the best trade-off between observability and cost.

Logging cost controls

To optimize your Logging usage, you can reduce the number of logs that are ingested into Logging. There are several strategies that you can use to help reduce log volume while continuing to maintain the logs that your developers and operators need.

Exclude logs

You can exclude most logs that your developer and operations teams don't need from Logging or Error Reporting.

Excluding logs means that they don't appear in the Logging or Error Reporting UI. You can use logging filters to select specific log entries or entire logs to be excluded. You can also use sampling exclusion rules to exclude a percentage of logging entries. For example, you might choose to exclude certain logs based on high volume or the lack of practical value.

Here are several common exclusion examples:

  • Exclude logs from Cloud Load Balancing. Load balancers can produce a high volume of logs for high-traffic apps. For example, you could use a logging filter to set up an exclusion for 90% of messages from Cloud Load Balancing.
  • Exclude Virtual Private Cloud (VPC Service Controls) Flow Logs. Flow logs include logs for each communication between virtual machines in a VPC Service Controls network, which can produce a high volume of logs. There are two approaches to reduce log volume, which you might use together or separately.

    • Exclude by logs entry content. Exclude most of the VPC Flow Logs, retaining only specific log messages that might be useful. For example, if you have private VPCs which shouldn't receive inbound traffic from external sources, you might only want to retain flow logs with sources fields from external IP addresses.
    • Exclude by percentage. Another approach is to sample only a percentage of logs identified by the filter. For example, you might exclude 95% and only retain 5% of the flow logs.
  • Exclude HTTP 200 OK responses from request logs. For apps, HTTP 200 OK messages might not provide much insight and can produce a high volume of logs for high-traffic apps.

Read Log exclusions to implement logging exclusions.

Export logs

You can export logs yet exclude them from being ingested into Logging. This lets you retain the logs in Cloud Storage and BigQuery or use Pub/Sub to process the logs, while excluding the logs from Logging, which might help reduce costs. This means that your logs don't appear in Logging, but they are exported.

Use this method to retain logs for longer-term analysis without incurring the cost of ingestion into Logging. For a detailed understanding of how exclusions and exports interact, see the life of a log diagram, which illustrates how exported log entries are treated in Logging.

Reduce Logging agent usage

You can reduce log volumes by not sending the additional logs generated by the Logging agent to Logging. The Logging agent streams logs from common third-party apps, such as Apache, Mongodb, and MySQL.

For example, you might reduce log volumes by choosing not to add the Logging agent to virtual machines in your development or other nonessential environments to Logging. Your virtual machines continue to report the standard logs to Logging, but don't report logs from third-party apps nor the syslog.

Monitoring cost controls

To optimize your Monitoring usage, you can reduce the volume of chargeable metrics that are ingested into Monitoring and the number of read calls to the Monitoring API. There are several strategies that you can use to reduce metric volume while continuing to maintain the metrics that your developers and operators need.

Optimize metrics and label usage

The way that you use labels on Monitoring custom metrics can affect the volume of time series that are generated.

If you have a custom metric with two labels—for example, cost_center and env values— then you can calculate the maximum number of time series by multiplying the cardinality of both labels.

total_num_time_series = cost_center_cardinality * env_cardinality

If there are 11 cost_center values and 5 env values, that means that up to 55 time series can be generated. This is why adding additional metric labels can add significant metric volume and, therefore, increase the cost. See the Cloud Monitoring tips and tricks: Understanding metrics and building charts blog post for a detailed description of metric cardinality.

We recommend the following to minimize the number of time series:

  • Where possible, limit the number of custom metric labels.
  • Select labels thoughtfully to avoid label values with high cardinality. For example, using user_id as a label results in at least one time series for each user, which could be a very large number if you have a lot of traffic.

Reduce Monitoring agents usage

Metrics sent from the Monitoring agent are chargeable metrics. The Monitoring agent streams app and system metrics from common third-party apps, such as Apache, MySQL, and Nginx, as well as additional Google Cloud VM-level metrics. If you don't need the detailed system metrics or metrics from the third-party apps for certain VMs, you can reduce the volume by not sending these metrics. You can also reduce the metric volumes by reducing the number of VMs using the Monitoring agent.

For example, you can reduce metric volumes by choosing not to add Google Cloud projects in your development or other nonessential environments to Monitoring. Additionally, you can choose not to include the Monitoring agent in VMs in development or other nonessential environments.

Reduce custom metrics usage

Custom metrics are chargeable metrics created by using the Monitoring API to monitor any metric that a user instruments. You can create these metrics by using the Monitoring API or by using integrations.

One such integration is OpenCensus. OpenCensus is a distribution of libraries that collect metrics and distributed traces from your apps. Apps instrumented with OpenCensus can report metrics to multiple backends including Monitoring by using custom metrics. These metrics appear in Monitoring under the custom.googleapis.com/opencensus prefix resource type. For example, the client roundtrip latency reported by OpenCensus appears in Monitoring under the custom.googleapis.com/opencensus/grpc.io/client/roundtrip_latency resource type.

The more apps that you instrument to send metrics, the more custom monitoring metrics are generated. If you want to reduce metric volumes, you can reduce the number of custom monitoring metrics that your apps send.

Trace cost controls

To optimize Trace usage, you can reduce the number of spans ingested and scanned. When you instrument your app to report spans to Trace, you use sampling to ingest a portion of the traffic. Sampling is a key part of a tracing system because it provides insight into the breakdown of latency caused by app components, such as RPC calls. Not only is this a best practice for using Trace, but you might reduce your span volume for cost-reduction reasons as well.

Use OpenCensus sampling

If you use Trace as an export destination for your OpenCensus traces, you can use the sampling feature in OpenCensus to reduce the volume of traces that are ingested.

For example, if you have a popular web app with 5000 queries/sec, you might gain enough insight from sampling 5% of your app traffic rather than 20%. This reduces the number of spans ingested into Trace to one-fourth.

You can specify the sampling in the instrumentation configuration by using the OpenCensus Python libraries for Trace. As an example, the OpenCensus Python exporter provides a ProbabilitySampler that you can use to specify a sampling rate.

from opencensus.trace.samplers import probability
from opencensus.trace import tracer as tracer_module

# Sampling the requests at the rate equals 0.05
sampler = probability.ProbabilitySampler(rate=0.05)
tracer = tracer_module.Tracer(sampler=sampler)

Use Cloud Trace API span quotas

You can use quotas to limit usage of Trace and cost. You can enforce span quotas with the API-specific quota page in the Google Cloud console.

Setting a specific quota that is lower than the default product quota means that you guarantee that your project won't go over the specific quota limit. This is a way to ensure that your costs are expected. You can monitor this quota from the API-specific quota page, as illustrated in the following image.

Monitoring the API-specific quota page

If you reduce your span quota, then you should also consider monitoring the span quota metrics and setting up an alerting policy in Monitoring to send an alert when the usage is nearing the quota. This alert prompts you to look at the usage and identify the app and developer that might be generating the large volume of spans. If you set a span quota and it is exceeded, the spans are dropped until you adjust the quota.

For example, if your span quota is 50M ingested spans, you can set an alert whenever you have used 80% of your API quota, which is 40M spans. Follow the instructions in managing alerting policies to create an alerting policy by using the following details.

  1. In the Google Cloud console, go to Monitoring or use the following button:
    Go to Monitoring
  2. In the Monitoring navigation pane, select Alerting, and then select Create Policy.
  3. Enter a name for the alerting policy.
  4. Click Add Condition:
    1. The settings in the Target pane specify the resource and metric to be monitored. Click the text box to enable a menu, and then select the resource global.
    2. Click the text box to enable a menu, and then select cloudtrace.googleapis.com/billing/monthly_spans_ingested.
    3. Add the following Filter values:
      • Click project_id, and then select your Google Cloud project ID.
      • Click chargeable, and then select true
    4. The settings in the Configuration pane of the alerting policy determine when the alert is triggered. Complete the following fields:
      • For Condition triggers if select Any time series violates
      • For Threshold, enter 40000000
      • For, select most recent value
    5. Click Save.
  5. Click Save.

    The alert generated from the alerting policy is similar to the following alert. In the alert, you can see the details about the project, the alerting policy that generated the alert, and the current value of the metric.

Alert details

Optimize third-party callers

Your app might be called by another app. If your app reports spans, then the number of spans reported by your app might depend on the incoming traffic that you receive from the third-party app. For example, if you have a frontend microservice that calls a checkout microservice and both are instrumented with OpenCensus, the sampling rate for the traffic is at least as high as the frontend sampling rate. Understanding how instrumented apps interact lets you assess the impact of the number of spans ingested.

Logging export

If your costs related to the Logging exports are a concern, one solution is to update your logging sink to use a logging filter to reduce the volume of logs that are exported. You can exclude logs from the export that you don't need.

For example, if you have an environment with an app running on Compute Engine and using Cloud SQL, Cloud Storage, and BigQuery, you can limit the resulting logs to only include the information for those products. The following filter limits the export to logs for Cloud Audit Logs, Compute Engine, Cloud Storage, Cloud SQL, and BigQuery. You can use this filter for a logging sink and only include the selected logs.

logName:"/logs/cloudaudit.googleapis.com" AND
(resource.type:gce OR
resource.type=gcs_bucket OR
resource.type=cloudsql_database OR
resource.type=bigquery_resource)

Conclusion

Google Cloud Observability provides the ability to view product usage data so that you can understand the details of your product usage. This usage data lets you configure the products so that you can appropriately optimize your usage and costs.

What's next