Stackdriver cost optimization

Stackdriver Logging, Stackdriver Monitoring, and Stackdriver APM are cloud-based managed services designed to provide deep observability into app and infrastructure services. One of the benefits of managed services on Google Cloud Platform (GCP) is that services are usage-based, which means you pay only for what you use. While this pricing model might provide a cost benefit when compared to standard software licensing, it might make it challenging to forecast cost. This solution describes the ways that you can understand your Stackdriver usage and optimize your costs.

About Stackdriver pricing

Each of the services in the Stackdriver suite are managed services. This lets you focus on the use of insights provided by Logging, Monitoring, and APM, rather than the infrastructure required to use these services. When you use the Stackdriver services, you don't have to individually pay for virtual machines, software licenses, security scanning, hardware maintenance, or space in a data center. The Stackdriver managed services provide a simple per-usage cost.

Stackdriver costs include charges for Logging, Monitoring, and APM, which include Stackdriver Trace, Stackdriver Profiler, Stackdriver Error Reporting, and Stackdriver Debugger. Profiler and Error Reporting don't have a cost while still in beta, and you can use Debugger at no cost.

Logging costs

Logging prices are based on the volume of chargeable logs ingested. This product pricing provides a simple per-GiB cost. This is a free allotment per month, and certain logs, such as Cloud Audit Logging, are non-chargeable.

Example product usage that generates cost through additional log volume include using:

  • Cloud Load Balancing
  • The Logging agent on Compute Engine
  • The Logging agent on Amazon Web Services (AWS)
  • The write operation in the Stackdriver Logging API

Monitoring costs

Monitoring prices are based on the volume of chargeable metrics ingested and the number of chargeable API calls. For example, non-GCP metrics such as agent, custom, external, and AWS metrics are chargeable. The projects.timeSeries.list method in the Stackdriver Monitoring API is charged by API call while the remainder of the API usage is free. This is a free, metric-volume allotment per month, and many of the metrics, including all of the GCP metrics, are non-chargeable. See Stackdriver pricing for more information about which metrics are chargeable.

Example product usage that generates cost through metric volume and API calls includes using:

  • Monitoring custom metrics
  • The Monitoring agent on Compute Engine
  • The Monitoring agent on AWS
  • The read operation in the Monitoring API

Trace costs

Trace prices are based on the number of spans ingested and eventually scanned. Some GCP services, such as the App Engine standard environment, automatically produce non-chargeable spans. There is a free allotment per month for Trace.

Example product usage that generates cost through spans ingested includes adding instrumentation for your:

  • Spans for App Engine apps outside of the default spans
  • Cloud Load Balancing
  • Custom apps

About Stackdriver usage

Understanding your Stackdriver usage can provide insight into what components of the Stackdriver suite generate cost. This helps to identify areas that might be appropriate for optimization.

GCP bill analysis

The first step to understanding your usage is to review your GCP bill and understand your Stackdriver-related costs. One way to gain insight is to use the Billing Reports available in the Google Cloud Platform Console.

The Reports page offers a useful range of filters to narrow the results by time, project, products, SKUs, and labels. To narrow the billing data to the Stackdriver-related costs specifically, you can add a products filter selecting all the Stackdriver products and then group by product. The resulting graph, illustrated in the following image, provides your cost breakdown across the Stackdriver products.

Cost breakdown by Stackdriver

Logging

Logging provides detailed lists of logs, current log volume, and projected monthly volume for each project. You can review these details for each project while reviewing your Logging charges on your GCP bill. This makes it simple to view which logs have the highest volume and, therefore, contribute the most to the cost.

You can find the volume of logs ingested in your project in the Logs Ingestion section of Logging. The Logs Viewer provides a list of the logs, their previous month, current month, as well as excluded and projected End of Month (EOM) log volumes. In the following image, each log links to Monitoring, which displays a graph of the volume of the log over time.

Logs viewer with log links

This analysis lets you develop insight into the usage for the logs in specific projects and how their volume has changed over time. You can use this analysis to identify which logs you should consider optimizing.

Monitoring

Monitoring, organized into Workspaces, provides a detailed list of projects and previous, current, and projected metric volumes. Because Stackdriver Workspaces might include more than one project, the volumes for each project are listed separately, as illustrated in the following image.

Workspace with multiple projects

Learn how to find the Workspace Monitoring usage details.

You can view the detailed graph of the metrics volume metric for each project in the Metrics Explorer in Monitoring, which provides insight into the volume of metrics ingested over time.

This analysis provides you with the Monitoring metric volumes for each project that you identified while reviewing your Monitoring charges on your GCP bill. You can then review the specific metric volumes and understand which components are contributing the most volume and cost.

Trace

Trace provides a detailed view of the spans ingested for the current and previous month. You can review these details in the GCP Console for each project that you identify while reviewing your Trace charges on your GCP bill.

This analysis provides you with the number of spans ingested for each project in a Stackdriver Workspace that you identified while reviewing your Trace charges on your GCP bill. Then you can review the specific number of spans ingested and understand which projects and services are contributing the highest number of spans and cost.

Logging export

Logging provides a logging sink to export logs to Cloud Storage, BigQuery, and Cloud Pub/Sub. Costs related to these exports aren't included in the standard Stackdriver pricing because they are reflected in the costs for the respective products.

For example, if you export all of your Stackdriver logs to BigQuery for long-term storage and analysis, you incur the BigQuery costs, including per-GiB storage and any query costs.

To understand the costs your exports are generating, consider the following steps:

  1. Find your logging sinks. Find out what, if any, logging sinks that you have enabled. For example, your project might already have several logging sinks that were created for different purposes, such as security operations or to meet regulatory requirements.
  2. Review your usage details. Review usage for the destination of the exports. For example, BigQuery table sizes for a BigQuery export or the bucket sizes for Cloud Storage exports.

Find your logging sinks

Your logging sinks might be at the project level (one or more sinks per project) or they might be at the GCP organizational level, called aggregated exports. The sinks might include many projects' logs in the same GCP organization.

You can view your logging sinks by looking at specific projects. The GCP Console provides a list of the sinks and the destination. To view your aggregated exports for your organization, you can use the gcloud logging sinks list --organization ORGANIZATION_ID command line.

Review your usage details

Monitoring provides a rich set of metrics not only for your apps, but also for your GCP product usage. You can get the detailed usage metrics for Cloud Storage, BigQuery, and Cloud Pub/Sub by viewing the usage metrics in Monitoring Metrics Explorer.

Cloud Storage

By using the Metrics Explorer in Monitoring, you can view the storage size for your Cloud Storage buckets. Use the following values in the Metrics Explorer to view the storage size of your Cloud Storage buckets used for the logging sinks.

GO TO THE METRICS EXPLORER PAGE

  1. For the Resource type, enter gcs_bucket.
  2. For the Metric, enter storage.googleapis.com/storage/total_bytes.
  3. Add the following Filters:

    1. Click project_id, and then select your GCP project ID.
    2. Click dataset_name, and then select your Cloud Storage export bucket name.

Storage size of Cloud Storage buckets

The previous graph shows the size of the data in KB exported over time, which provides insight into the usage for the Logging export to Cloud Storage.

BigQuery

By using the Metrics Explorer in Monitoring, you can view the storage size for your BigQuery dataset. Use the following values in the Metrics Explorer to view the storage size of your BigQuery dataset used for the logging sink.

GO TO THE METRICS EXPLORER PAGE

  1. For the Resource Type, enter bigquery_dataset.
  2. For the Metric, enter bigquery.googleapis.com/storage/stored_bytes.
  3. Add the following Filters:

    1. Click project_id, and then select your GCP project ID.
    2. Click dataset_id, and then select your BigQuery export dataset name.
  4. For Group By, enter dataset_id.

Storage size of BigQuery dataset

The previous graph shows the size of the export dataset in GB over time, which provides insight into the usage for the Logging export to BigQuery.

Cloud Pub/Sub

By using the Metrics Explorer in Monitoring, you can view the number of messages and the sizes of the messages exported to Cloud Pub/Sub. Use the following values in the Metrics Explorer to view the storage size of your Cloud Pub/Sub topic used for the logging sink.

GO TO THE METRICS EXPLORER PAGE

  1. For the Resource type, enter pubsub_topic.
  2. For the Metric, enter pubsub.googleapis.com/topic/byte_cost.
  3. Add the following Filters:

    1. Click project_id, and then select your GCP project ID.
    2. Click topic_id, and then select your Cloud Pub/Sub export topic name.

Storage size of Cloud Pub/Sub topic

The previous graph shows the size of the data in KB exported over time, which provides insight into the usage for the Logging export to Cloud Pub/Sub.

Implementing Stackdriver cost controls

The following options describe potential ways to reduce your Stackdriver costs. Each option comes at the expense of limiting insight into your apps and infrastructure. Choose the option that provides you with the best trade-off between observability and cost.

Logging cost controls

To optimize your Logging usage, you can reduce the number of logs that are ingested into Logging. There are several strategies that you can use to help reduce log volume while continuing to maintain the logs that your developers and operators need.

Exclude logs

You can exclude most logs that your developer and operations teams don't need from Logging or Error Reporting.

Excluding logs means that they don't appear in the Logging or Error Reporting UI. You can use logging filters to select specific log entries or entire logs that are excluded. You can also use sampling exclusion rules to exclude a percentage of logging entries. For example, you might choose to exclude certain logs based on high volume or the lack of practical value.

Here are several common exclusion examples:

  • Exclude logs from Cloud Load Balancing. Load balancers can produce a high volume of logs for high-traffic apps. For example, you could use a logging filter to set up an exclusion for 90% of messages from Cloud Load Balancing.
  • Exclude Virtual Private Cloud (VPC Service Controls) Flow Logs. Flow logs include logs for each communication between virtual machines in a VPC Service Controls network, which can produce a high volume of logs. There are two approaches to reduce log volume, which you might use together or separately.

    • Exclude by logs entry content. Exclude most of the VPC flow logs, retaining only specific log messages that might be useful. For example, if you have private VPCs which shouldn't receive inbound traffic from external sources, you might only want to retain flow logs with sources fields from external IP addresses.
    • Exclude by percentage. Another approach is to sample only a percentage of logs identified by the filter. For example, you might exclude 95% and only retain 5% of the flow logs.
  • Exclude HTTP 200 OK responses from request logs. For apps, HTTP 200 OK messages might not provide much insight and can produce a high volume of logs for high-traffic apps.

Read Log exclusions to implement logging exclusions.

Export logs

You can export logs yet exclude them from being ingested into Logging. This lets you retain the logs in Cloud Storage and BigQuery or use Cloud Pub/Sub to process the logs, while excluding the logs from Logging, which might help reduce costs. This means that your logs don't appear in Logging, but they are exported.

Use this method to retain logs for longer-term analysis without incurring the cost of ingestion into Logging. For a detailed understanding of how exclusions and exports interact, see the life of a log diagram, which illustrates how exported log entries are treated in Logging.

Follow the instructions in the design patterns for exporting Logging to implement exports from Logging.

Reduce Logging agent usage

You can reduce log volumes by not sending the additional logs generated by the Logging agent to Logging. The Logging agent streams logs from common third-party apps, such as Apache, Mongodb, and MySQL.

For example, you might reduce log volumes by choosing not to add the Logging agent to virtual machines in your development or other nonessential environments to Logging. Your virtual machines continue to report the standard logs to Logging, but don't report logs from third-party apps nor the syslog.

Monitoring cost controls

To optimize your Monitoring usage, you can reduce the volume of chargeable metrics that are ingested into Monitoring and the number of read calls to the Monitoring API. There are several strategies that you can use to reduce metric volume while continuing to maintain the metrics that your developers and operators need.

Optimize Stackdriver metrics and label usage

The way that you use labels for GCP components might impact the volume of time series that are generated for your metrics in Monitoring.

For example, you can use labels on your VMs to appropriately report metrics to cost centers on your GCP bill and to signify whether specific GCP environments are production or development, as illustrated in the following image.

Labels on Google Kubernetes Engine clusters

Adding these labels means that additional time series are generated in Monitoring. If you label your virtual machines with cost_center and env values, then you can calculate the total number of time series by multiplying the cardinality of both labels.

total_num_time_series = cost_center_cardinality * env_cardinality

If there are 11 cost_center values and 5 env values, that means 55 time series are generated. This is why the way that you add labels might add significant metric volume and, therefore, increase the cost. See the Stackdriver tips and tricks: Understanding metrics and building charts blog post for a detailed description of metric cardinality.

We recommend the following to minimize the additional time series:

  1. Where possible, limit the number of labels.
  2. Select label values thoughtfully to avoid label values with high cardinality. For example, using an IP address as a label results in one time series for each IP address, which could be a large number if you have many VMs.
Reduce Monitoring agents usage

Metrics sent from the Monitoring agent are chargeable metrics. The Monitoring agent streams app and system metrics from common third-party apps, such as Apache, MySQL, and Nginx, as well as additional GCP VM-level metrics. If you don't need the detailed system metrics or metrics from the third-party apps for certain VMs, you can reduce the volume by not sending these metrics. You can also reduce the metric volumes by reducing the number of VMs using the Monitoring agent.

For example, you can reduce metric volumes by choosing not to add GCP projects in your development or other nonessential environments to Monitoring. Additionally, you can choose not to include the Monitoring agent in VMs in development or other nonessential environments.

Reduce custom metrics usage

Custom metrics are chargeable metrics created by using the Monitoring API to monitor any metric that a user instruments. You can create these metrics by using the Monitoring API or by using tools that are integrated with Stackdriver.

One such tool is OpenCensus. OpenCensus is a distribution of libraries that collect metrics and distributed traces from your apps. Apps instrumented with OpenCensus can report metrics to multiple backends including Stackdriver by using custom metrics. These metrics appear in Monitoring under the custom.googleapis.com/opencensus prefix resource type. For example, the client roundtrip latency reported by OpenCensus appears in Monitoring under the custom.googleapis.com/opencensus/grpc.io/client/roundtrip_latency resource type.

The more apps that you instrument to send metrics, the more custom monitoring metrics are generated. If you want to reduce metric volumes, you can reduce the number of custom monitoring metrics that your apps send.

Trace cost controls

To optimize Trace usage, you can reduce the number of spans ingested and scanned. When you instrument your app to report spans to Trace, you use sampling to ingest a portion of the traffic. Sampling is a key part of a tracing system because it provides insight into the breakdown of latency caused by app components, such as RPC calls. Not only is this a best practice for using Trace, but you might reduce your span volume for cost-reduction reasons as well.

Use OpenCensus sampling

If you use Trace as an export destination for your OpenCensus traces, you can use the sampling feature in OpenCensus to reduce the volume of traces that are ingested.

For example, if you have a popular web app with 5000 queries/sec, you might gain enough insight from sampling 5% of your app traffic rather than 20%. This reduces the number of spans ingested into Trace by one-fourth.

You can specify the sampling in the instrumentation configuration by using the OpenCensus Python libraries for Trace. As an example, the OpenCensus Python exporter provides a ProbabilitySampler that you can use to specify a sampling rate.

from opencensus.trace.samplers import probability
from opencensus.trace import tracer as tracer_module

# Sampling the requests at the rate equals 0.05
sampler = probability.ProbabilitySampler(rate=0.05)
tracer = tracer_module.Tracer(sampler=sampler)
Use Stackdriver Trace API span quotas

You can use quotas to limit usage of Trace and cost. You can enforce span quotas with the API-specific quota page in the GCP Console.

Setting a specific quota that is lower than the default product quota means that you guarantee that your project won't go over the specific quota limit. This is a way to ensure that your costs are expected. You can monitor this quota from the API-specific quota page, as illustrated in the following iamge.

Monitoring the API-specific quota page

If you reduce your span quota, then you should also consider monitoring the span quota usage and setting up an alerting policy in Monitoring to send an alert when the usage is nearing the quota. This alert prompts you to look at the usage and identify the app and developer that might be generating the large volume of spans. If you set a span quota and it is exceeded, the spans are dropped until you adjust the quota.

For example, if your span quota is 50M ingested spans, you can set an alert whenever you have used 80% of your API quota, which is 40M spans. Follow the instructions in managing alerting policies to create an alerting policy by using the following details.

GO TO MONITORING

  1. For the Resource type, enter global.
  2. For the Metric, enter cloudtrace.googleapis.com/billing/monthly_spans_ingested.
  3. Add the following Filter values:

    1. Click project_id, and then select your GCP project ID.
    2. Click chargeable, and then select true.
  4. In the Configuration section, complete the following fields:

    1. For Condition triggers if select Any time series violates.
    2. For Condition, select is above.
    3. For Threshold, enter 40000000.
    4. In the For field, select most recent value.

The alert generated from the alerting policy is similar to the following alert. In the alert, you can see the details about the project, the alerting policy that generated the alert, and the current value of the metric.

Alert details

Optimize third-party callers

Your app might be called by another app. If your app reports spans, then the number of spans reported by your app might depend on the incoming traffic that you receive from the third-party app. For example, if you have a frontend microservice that calls a checkout microservice and both are instrumented with OpenCensus, the sampling rate for the traffic is at least as high as the frontend sampling rate. Understanding how instrumented apps interact lets you assess the impact of the number of spans ingested.

Logging export

If your costs related to the Logging exports are a concern, one solution is to update your logging sink to use a logging filter to reduce the volume of logs that are exported. You can exclude logs from the export that you don't need.

For example, if you have an environment with an app running on Compute Engine and using Cloud SQL, Cloud Storage, and BigQuery, you can limit the resulting logs to only include the information for those products. The following filter limits the export to logs for Cloud Audit Logging, Compute Engine, Cloud Storage, Cloud SQL, and BigQuery. You can use this filter for a logging sink and only include the selected logs.

logName:"/logs/cloudaudit.googleapis.com" AND
(resource.type:gce OR
resource.type=gcs_bucket OR
resource.type=cloudsql_database OR
resource.type=bigquery_resource)

Conclusion

Logging, Monitoring, and the APM suite provide the ability to view product usage data so that you can understand the details of your product usage. This usage data lets you configure the products so that you can appropriately optimize your usage and costs.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...