The metrics these APIs provide are the "golden signals" that Google's own Site Reliability engineers use to assess the health of a service. Those metrics are the overall traffic, the error rate, and the latency (at various percentiles). The metrics you see are specific to your project's use of the API and do not necessarily reflect usage or performance for any other users.
- The API Dashboard gives you a basic view of your usage, with pre-built charts on each API detail page. Any metrics used in this page are also published to Stackdriver.
- Stackdriver lets you configure robust monitoring for your application by adding these metrics to custom dashboards, cross referencing with other available stats (including custom metrics sent by your application, and GCP-supplied service specific metrics like queue information for Pub/Sub or query data for Spanner), and setting up alerts to warn you about unusual application behavior.
You can find a detailed reference for API metrics in the Stackdriver Monitoring documentation.
Using the API Dashboard
The simplest way to get a quick view of API metrics is to use the Cloud Platform Console's API Dashboard. You can see an overview of all your API usage metrics, or you can drill down to your usage of a specific API.
To see an overview of your usage metrics in the console, go to your Google Cloud project's APIs and Services section — the main API Dashboard is displayed by default. In this page you can see all the APIs you currently have enabled for your project, as well as overview charts for the following metrics:
- Traffic: the number of requests per second made by your project to all your enabled APIs
- Errors: the percentage of your requests to your enabled APIs that have resulted in errors
If you have APIs enabled that support latency metrics, you'll also see the following:
- Median latency: the median latency for your requests
To view usage details for a specific API:
- Select the API you want to view in the main API Dashboard list of APIs. The API's Overview page shows a more detailed traffic chart with a breakdown by response code.
For even more detailed usage information, select View metrics. By default, the following pre-built charts are displayed, though more are available:
- Traffic by response code
- Errors by API method
If the API supports latency metrics you'll also see the following:
- Overall latency at the 50th, 95th, and 99th percentile
- Latency by API method (median)
If you want to add to the displayed charts, you can select additional pre-built charts from the Select Graphs drop-down menu.
If you use Stackdriver, you can dive deeper into available metrics data using the Metrics Explorer to give you greater insight into your API usage. Stackdriver supports a wide variety of metrics, which you can combine with filters and aggregations for new and insightful views into your application performance. For example, you can combine a request count metric with a filter on the HTTP Response Code class to build a dashboard that shows error rates over time, or you can look at the 95th percentile latency of requests to the Cloud Pub/Sub API.
To see API metrics in Metrics Explorer, select Consumed API as the resource type, then use the filter and aggregation options to refine your data. Once you've found the API usage information you want, you can use Stackdriver to create custom dashboards and alerts that will help you continue to monitor and maintain a robust application. You can find out how to do this in the following pages:
For more information, see Metrics Explorer.
Troubleshooting with API metrics
API metrics can be particularly useful if you need to contact Google when something goes wrong, and may even show you that you don't need to contact support at all. For example:
- If all of your calls to a service are failing for a single credential ID, but not any other, chances are there is something wrong with that account that you can easily fix yourself without opening a ticket.
- You’re troubleshooting a problem with your app, and notice a correlation between your application’s degraded performance and a sustained increase in the 50th percentile latency of a critical GCP service. Definitely call us and point us to this data so we can start working on the problem as quickly as possible.
- The latencies for a GCP service report look good and unchanged from before, but your in-app metrics report that the latency on calls to the service is abnormally high. That tells you that there is some trouble in the network. Call your network provider (in some cases, Google) to get the debugging process started.
While API metrics are an extremely useful tool, there are issues you need to consider to make sure they provide useful information, particularly when setting up alerts based on metric values. The following best practices will help you get the most from API metrics data.
Is latency causing a problem?
While some services are quite latency-sensitive, for others scale and reliability matter more. Some APIs, Cloud Storage or BigQuery for example, can have a couple of seconds of high latency without customers noticing. With data from API metrics, you can learn what your users need from a given service.
Look for changes from the norm
Before you decide to alert on a particular metric value, consider what actually counts as unusual behavior. Looking at your API metrics can show you that latency results for most services fall within a normal distribution: a big hump in the middle, and outliers on either side. The metrics will help you understand the normal distribution so that you can engineer your app to work well within the distribution curve. Metrics can also help you correlate distribution changes with times where your app is not working as intended, to help you find the root cause of an issue. We expect the 99th percentile to look very different than the median — what we don’t expect are dramatic changes in those percentiles over time.
Also you may see that some kinds of requests take longer than others. If the median size of a photo uploaded to Google Photos is 4 MB, but you normally upload 20 MB RAW files, your average time to upload 20 photos is likely to be substantially worse than that of most users, but is still your normal behavior.
All this means that it's not particularly useful to alert the first time a second-long RPC or 5xx HTTP call is detected. Instead, when investigating a Google service as a possible cause for an issue your application is experiencing, compare the return codes and latency rates over time and look for sustained changes from the norm that are correlated with observed issues in your application.
API metrics are most useful where you have a high volume of traffic going to the API. If you call a service only intermittently, your API metrics won’t be statistically valid and won’t give you meaningful triage information.
For example, if you want to track the 99.5th percentile latency for a service, and you only do 100 calls an hour, watching the measurement over a two hour period would only give you one data point to represent the 99.5th percentile, which won't tell you much about the normal behavior of the API or your application. Make sure the traffic rate, the percentile you are tracking, and the time window you are considering generate many data points of interest or the monitoring data will not be helpful to you.
The following APIs support all our API usage metrics, including latency metrics. Other APIs provide traffic and error metrics only.