Troubleshooting the Monitoring API

This guide explains some of the issues that might arise when you use the Monitoring API v3.

The Monitoring API is one of the set of Cloud APIs. These APIs share a common set of error codes. For a list of the error codes defined by the Cloud APIs and general suggestions on handling the errors, see Handling errors.

Use APIs Explorer for debugging

APIs Explorer is a widget built into the reference pages for API methods. It lets you invoke the method by filling out fields; it does not require you to write any code.

If you are having trouble with a method invocation, use the APIs Explorer (Try this API) widget on the reference page for that method to debug your problem. See APIs Explorer for more information.

General API errors

Here are some of the Monitoring API errors and messages you might see from your API calls:

  • 404 NOT_FOUND

    • "The requested URL was not found on this server": Some part of the URL is incorrect. Compare the URL against the URL for the method, shown on the method's reference page. Check for spelling errors ("project" instead of "projects") and capitalization problems ("TimeSeries" instead of "timeSeries").
  • 401 UNAUTHENTICATED with "User is not authorized to access the project (or metric)." This might be an authorization problem, but it can also mean that you simply misspelled a project ID or metric type name. Check your spelling and capitalization.

    If you are not using APIs Explorer, then try using it. If your API call works in APIs Explorer, then you probably do have an authorization issue in the environment you're using for your API call. Check in the API manager page to verify that the Monitoring API v3 is enabled for your project.

  • 400 INVALID_ARGUMENT with "Field filter had an invalid value": Check the spelling and formatting of your monitoring filter. For more information, see Monitoring Filters.

  • 400 INVALID_ARGUMENT with "Request was missing field interval.endTime"": You see this message if the end time missing, or if it is present but not properly formatted. If you are using APIs Explorer, do not quote the value of the time field.

    Here are some examples of correct time specifications:


Missing results

If your API call returns status code 200 and an empty response, there are several possibilities:

  • If your call uses a filter, then the filter might not have matched anything. The filter match is case-sensitive. To resolve filter problems, start by specifying only one filter component, such as metric.type, and see if you get results. Add the other filter components one by one to build up your request.

  • If you are working with a custom metric, you might not have specified the project where your custom metric is defined.

If you are fetching time-series data by using timeSeries.list, and some of the data points seem to be missing, then check the following additional causes:

  • If the data is more than a few weeks old, it might have expired. For more information, see Data retention.

  • If the data was just written, it might not yet be in Monitoring. For more information, see Latency of metric data.

  • Check that you specified the time interval correctly:

    • Check that the end time is correct.
    • Check that the start time is correct, and earlier than the end time. If the start time is missing or malformed, it defaults to the end-time value, and the time interval will match only points whose start and end times are exactly the interval's end time. (This is valid for GAUGE metrics, which measure a point in time, but not for CUMULATIVE or DELTA metrics, which measure across time intervals. For more information, see Time intervals.

Retrying API errors

Two of the Cloud APIs error codes indicate circumstances in which it might be useful to retry the request:

  • 503 UNAVAILABLE: retries are useful if the problem is a short-lived or transient condition.
  • 429 RESOURCE_EXHAUSTED: retries are useful, after a delay, only for long-running background jobs with time-based quota, for example, if you are limited to n calls per t seconds. But if you've exhausted a volume-based quota, retries do not help; you have to get your quota increased.

When writing code that might retry requests, first ensure that the request is safe to retry.

Is the request safe to retry?

If your request is idempotent, then it is safe to retry. An idempotent action is one where any change in state does not depend on the current state. For example:

  • Reading x is idempotent; there is no change to the value.
  • Setting x to 10 is idempotent; this might change the state, if the value isn't already 10, but it doesn't matter what the current value is. And it doesn't matter how many times you attempt to set the value.
  • Incrementing x is not idempotent; the new value depends on the current value.

Retry with exponential backoff

When implementing code to retry requests, you don't want to rapidly issue new requests indefinitely. If a system is overloaded, this approach contributes to the problem.

Instead, use a truncated exponential backoff approach. When requests fail because of transient overloads rather than true unavailability, the solution is reduce the load. A truncated exponential backoff follows this general pattern:

  • Establish how long you are willing to wait while retrying or how many attempts you are willing to make. When this limit is exceeded, consider the service unavailable and handle that condition appropriately for your application. This is what makes the backoff truncated; you stop retrying at some point.

  • Retry the request with increasingly long pauses to back off the frequency of retries. Retry until the request succeeds or your established limit is reached.

    The interval is typically increased by some function of the power of the retry count, making it an exponential backoff.

There are many ways to implement an exponential backoff. The following is a simple example that adds an increasing backoff delay to a minimum delay of 1000ms. The initial backoff delay is 2ms, and it increases to 2retry_countms with each attempt.

The following table shows the retry intervals using the initial values:

  • Minimum delay = 1s = 1000ms
  • Initial backoff = 2ms
Retry count Additional delay (ms) Retry after (ms)
0 20 = 1 1001
1 21 = 2 1002
2 22 = 4 1004
3 23 = 8 1008
4 24 = 16 1016
... ... ...
n 2n 1000 + 2n

You can truncate the retry cycle by stopping either after n attempts or when the time spent exceeds a reasonable value for your application.

For more information, see the Wikipedia article Exponential backoff.