Identifying causes of app latency with Cloud Monitoring and OpenCensus

Stay organized with collections Save and categorize content based on your preferences.

This document helps you identify the causes of tail latency by using OpenCensus and Monitoring to monitor metrics and distributed tracing for app developers.

When you observe how your app serves users, you find that some users experience higher app latency than others. The differences are amplified with more diverse populations of users and higher usage of resources. Identifying and addressing the sources of high latency can help you expand your user base to a larger population of users with more optimal levels of resource usage. If you increase resource usage, tail latency is the first thing that becomes apparent due to resource contention.

Although the users experiencing these higher latencies might be small, they can also be important, for example, if they represent the leading edge of a new potential market.

This document discusses how to identify and explain the sources of high latency that these users experience. In particular, the document answers the following questions:

  • How can you measure tail latency accurately?
  • What are the main causes of app latency?
  • How can you evaluate strategies for minimizing tail latency?

The main themes that answer these questions are:

  • Identifying the causes of inaccuracy in the estimates of monitoring metrics.
  • Identifying the causes of high latency.
  • Minimizing tail latency.

Source code for the test app is provided in a GitHub repository.

This document assumes that you are familiar with basic use of OpenCensus, Cloud Monitoring, Cloud Trace, BigQuery, and app development in Java.


A measure of app performance.
The time it takes the app to perform a task. The latency values in this document are measured from within the app's code.
tail latency
The higher latency values experienced by a small set of users. Tail latency can be defined at percentile levels. For example, the 99th percentile is the highest 1% of latency values for requests measured over a time interval. For more information, see the "Worrying about your Tail" section in the "Monitoring Distributed Systems" chapter of Site Reliability Engineering.
aggregation model
A representation of monitoring metrics that shows the statistical distribution of the measurements collected. This document assumes a histogram model with the range of latencies divided into buckets and the counts for each bucket recorded for each time interval. Monitoring uses the term distribution metric for metrics with this kind of representation.


Cloud app owners often monitor a combination of median and tail latency values. The median latency is a good indicator for the overall health of your app. However, as your app scales up, subtle problems can appear that aren't readily observable through the median latency.

Understanding tail latency can be useful for a number of purposes:

  • Detecting emerging problems early that aren't apparent in the median latency.
  • Making sure that high utilization isn't achieved at the cost of suboptimal handling of some requests.
  • Monitoring a fleet when a single server is unhealthy or when a code path that only a minority of users follow isn't handled well.
  • Informing strategies for minimizing tail latency, such as setting appropriate values for timeouts and backoff retry. You may adjust the timeout to slightly higher than the 99th percentile latency to abandon connections and retry rather than leaving requests open for extended times.

There can be many causes of tail latency. Typical causes are:

  • Variation in request payload size
  • High CPU utilization
  • Non-optimal application code structure
  • Variation in routing distance of network calls
  • Occasional unexpected server reboots with associated broken connections and cold starts
  • Buffering or queuing
  • Connection management, such as reestablishment of stale connections
  • Network congestion, including packet loss
  • Garbage collection or compaction of data leading to suspension of handling of user requests
  • Variable latency of cloud services
  • Non-optimal settings for connection timeout

Strategies in identifying causes of latency include:

  • Establish methods for accurate latency measurement: Because tail latency by definition is a small proportion of traffic, the accuracy of measurements needs to be understood.
  • Implement monitoring, tracing, and logging solutions: By using OpenCensus and Monitoring, a root span trace is created by the app for requests to contain child spans for different measurements. Child spans are then automatically generated with the OpenCensus HTTP integration and trace to logs integration is added for recording additional context. This minimizes additional code needed for monitoring and tracing.
  • Explore the data: Look at reports and charts that show the tail of long running requests and identify which child spans take the longest to execute.
  • Identify the reasons for tail latency: Gather information from related sources that can help explain the reasons for long child spans and compare that with annotations in sample spans.
  • Undestand more detailed queries: Export logs to BigQuery and combine with trace information to drill down and quantify app behavior.

The analysis discussed in this document is the result of analyzing high traffic systems scaled out with Cloud Storage and microservices. The primary causes of variability in tail latency were measurement errors, size of payload, variable latency in cloud services, and load on the virtual machines hosting the app.

Cloud users have a choice between using metrics from prebuilt integrations included in libraries, supported metrics from cloud providers, and custom metrics. Examples of metrics from prebuilt integrations are the gRPC, HTTP, and SQL integrations from OpenCensus. Google Cloud includes a large number of supported metrics, which are exposed through Monitoring. Another alternative is to develop code for additional custom metrics, which are often timers and counters. Comparing time series from these three sources is a common task in analyzing source of latency in microservices and storage of app data.

Overview of the test app

The test app simulates tail latency problems and integrates with OpenCensus and Monitoring to demonstrate the tools and approaches needed to identify the problems. The source code, build files, and instructions are provided in the GitHub project. The example app is written in Java, but you can apply this document to other languages. You can find out more about the OpenCensus instrumentation of HTTP requests in Jetty Integration for Java and Go. The test app is based on this GitHub example code.

The following is a schematic diagram of the app.

Architectural diagram of client app communicating with Cloud Storage and server app on Compute Engine.

The client app communicates with Cloud Storage and a server-side app hosted on a Compute Engine instance.

The following is a typical request flow:

  1. The Cloud Storage API retrieves a JSON file from Cloud Storage in the client app.
  2. The contents of the file are sent to the server.
  3. The server does some processing of data sent by the client to simulate the time and resources taken by a real app.
  4. An HTTP response is sent back to the client.
  5. The client performs downstream processing. At this point, the test app substitutes simple arithmetic for business logic.

At startup, the app initializes the following:

  • The client Java Util Logging (JUL) to export Cloud Logging with trace IDs included. The server uses a logback logging appender for trace-log correlation.
  • OpenCensus HTTP metrics views are registered. The OpenCensus HTTP integration for Java, includes a number of metrics views for latency, HTTP response code, and bytes sent and received.
  • The OpenCensus Monitoring stats exporter sends aggregated metrics to Monitoring in a background thread.
  • The OpenCensus Monitoring trace exporter sends trace spans to Monitoring in a background thread.
  • The Cloud Storage client API downloads files that are used for HTTP payloads.
  • An app thread pool is created to drive load to the app. Multiple threads drive a higher load to the server than a single thread does.

Monitoring metrics and trace data are collected and processed using the following steps:

  1. OpenCensus libraries collect metrics and trace data.
  2. OpenCensus periodically aggregates metrics data and exports it to Monitoring. Trace data is exported to Trace without aggregation.
  3. OpenCensus adds trace IDs to Cloud Logging entries. Log entries are exported to the Cloud Logging service.

You can export the metrics collected with OpenCensus to a number of backend monitoring systems that provide storage, analysis, visualization, and alerting. Each system has a different data representation format. This document focuses on Monitoring.

Accuracy of aggregated monitoring metrics

OpenCensus and Monitoring provide tools that you can use to monitor app performance. To avoid a false sense of confidence in lines on charts, you need to understand how they were created, especially for lines that represent tail latency, such as the 99th percentile (p99). The following sections examine the limitations of aggregating metrics data.

Representation of metrics time series in Monitoring

Monitoring computes latency percentiles for distribution metrics based on bucket boundaries at numeric intervals. This method is a common method used by Monitoring and OpenCensus where OpenCensus represents and exports metrics data to Monitoring. The TimeSeries.list method for the Cloud Monitoring API returns the bucket counts and boundaries for your project and metric types. You can retrieve the bucket boundaries in the Cloud Monitoring API BucketOptions object, which you can experiment with in the API Explorer for TimeSeries.list.

To filter for the OpenCensus HTTP client latency in the API Explorer, you can use the following filter:

resource.type=global metric.type=""

The result includes an array of data point objects. The following example shows the scheme of buckets and bucket counts:

"points": [ {
  "interval": { "startTime": "2019-02-14T18:11:24.971Z",
                "endTime": "2019-02-14T18:33:35.541Z" },
  "value": {
    "distributionValue": {
    "explicitBuckets": {
      "bounds": [0, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 20, 25, 30, 40, 50, 65, 80, 100, 130, ...] }
      "bucketCounts": [0, 0, 0, 0, 0, 14, 17, 164, 3, 0, 1, 0, 0, 0, 0, ...]

Each point that you see in a Monitoring chart for a latency value is derived from a distribution. You can define the buckets by using a growth factor, instead of defining boundaries for each bucket. For example, a growth factor of 2.0 would give bucket boundaries of 0-1, 1-2, 2-4, 4-8,... For more information about Monitoring metrics, see Structure of metric types.

You can edit the chart to show lines for the 50th (median), 95th, and 99th percentile latency to display the following line chart.

Tail latency chart showing 50th, 95th, and 99th percentile.

To understand the data that is used to calculate these percentile values, add the Heatmap chart type. When the Aggregator is set to sum, the following client latency heatmap is displayed.

Heatmap chart of tail latency.

Heatmaps are a way of displaying the distribution data stored by Monitoring. The areas of the heatmap correspond the bucket boundaries retrieved by using the REST API. You can hold the pointer over the areas of the heatmap to find out how many points are contained in any given bucket for a time period. The data used to compute the percentile values in the line chart is sparse.

If you want more accurate latency measurements, a few strategies are available:

  • Add more data. You can generate more data in experiments, but traffic to a production app might not be high enough.
  • Make the buckets smaller, and change the boundaries to better suit your data. This is an option if you're creating your own custom metrics. However, you might not be able to make these adjustments if you're using data from prebuilt OpenCensus integrations or collected by Google Cloud from its backend services. The bucket boundaries for the OpenCensus integrations are determined by implementers of the OpenCensus integrations in Views. For example, the HTTP client latency bucket boundaries are defined in the class Similarly, the Google Cloud metrics are defined internally in Monitoring.
  • Use traces or logs instead of metrics data for tail percentiles.These options are explored in later sections of this document.

Simulated data

You can compare the accuracy of aggregation models with raw simulated data drawn from a random distribution to see how much difference the bucket boundaries can make. In this example, a beta distribution models the skewed distribution typical of real latency values. This is a distribution of a continuous random variable with no values less than zero, a distinct peak, and a long decreasing tail. The beta distribution uses the random.beta function from NumPy with parameters a = 2, b = 5, and is multiplied by a constant factor of 70 to simulate realistic latency values with a median value of 18.5 ms. The simulation includes 100,000 random values from the beta distribution in the Colab sheet tail_latency_lab.ipynb included in the GitHub project.

The following chart shows that buckets are too sparsely spaced with growthFactor = 2.0 to give accurate estimates of tail latencies.

Aggregate model showing growth factor of 2.0.

The linear axis on chart shows the intended shape of the random distribution. The following chart shows the bucket boundaries with growthFactor = 1.4.

Aggregate model showing growth factor of 1.4.

The following chart shows that the bucket boundaries with growthFactor = 1.2 resemble the intended shape more closely, but it's still somewhat coarse at the high end of the distribution.

Aggregate model showing growth factor of 1.2.

The following table shows estimates based on the bucket-based distribution compared to a computation based on raw values.

Distribution Median 99th percentile
Raw 18.5 49.5
Growth factor 2.0 19.2 62.0
Growth factor 1.4 18.7 53.9
Growth factor 1.2 18.6 51.1

In this case, a growth factor of 1.2 is required to give a reasonably accurate estimate of the 99th percentile latency. From within the OpenCensus exporter, you can convert the values from raw values to a sampled distribution. When you create an OpenCensus view, you can configure bucket boundaries. Views contain an Aggregation object. For more information, see the Java documentation or the Go documentation.

Measurement of HTTP latency with OpenCensus

When measuring HTTP latency, you measure the time to send and receive an HTTP request from a client to a server and record the result. You can implement this in code either by creating an OpenCensus Tag or by using the prebuilt integration for HTTP clients. The HTTP integration is provided for Go with the Go HTTP package and for Java with the Jetty HTTP client. The advantage of using the OpenCensus HTTP integration is that you don't have to add Tag objects for every metric that you want to collect.

Identifying sources of latency

The previously described test app generated stress on itself that led to increased tail latency. The following problems are demonstrated:

  • Occasionally large payloads
  • High CPU
  • Short timeouts
  • Variations in app code path with retries

The following sections describe how to detect and quantify these problems.

Effect of large payload

To investigate the effect of payload size, the following code randomly increased the size of the payload sent in 1 out of 20 HTTP requests:

static byte[] getContent(String bucket) {
  BlobId blobId = null;
  int n = rand.nextInt(100);
  if (n >= 95) {
    blobId = BlobId.of(bucket, LARGE_FILE);
  } else {
    blobId = BlobId.of(bucket, SMALL_FILE);
  return storage.readAllBytes(blobId);

When you view the traces for higher latency requests in the Trace list, they can be correlated to larger payload size.

Trace list showing higher-latency requests.

If you click Show Events, you can view more details about the payload sent and received. If there is high latency, it shows a larger payload size.

For the following reasons, Cloud Trace is better than Monitoring at identifying latency:

  • Trace gives latency values for requests instead of aggregated values.
  • Trace gives the payload size for the request.

Effect of high CPU utilization

High CPU utilization can cause increased tail latency due to concurrent processing, which can result in resource contention and subsequent queuing of the CPU to handle requests. Monitoring provides CPU utilization as one of the built-in metrics for Compute Engine instances. However, when you run performance tests or investigate performance problems, we recommend that you install Monitoring agent because it provides more operating systems and open source metrics, including memory, disk usage, and other metrics that can show different types of resource contention. For more information, see the list of Agent metrics.

Managing CPU utilization is easier for server apps than client apps due to autoscaling. On Compute Engine, a managed instance group can use an autoscaling policy based on average CPU utilization, load-balancing serving capacity, Monitoring metrics, and schedules. An autoscaler adds virtual machines to size the fleet appropriately to handle the load. Client CPU utilization can be harder to manage consistently and can affect the measurement of latency because payloads must be read from the network and deserialized, which requires CPU utilization. Because the client app view of latency determines user experience to a greater degree, this document focuses on that type of latency.

Monitoring is more appropriate than tracing to identify the effect of high CPU utilization because high CPU isn't generally the result of one HTTP request. In the following example, the effect of CPU utilization is investigated by using the test app to run multiple executor threads at once and running the client code on a g1-small virtual machine. Then the CPU utilization and latency is compared to values from the client app running on a n1-standard-1 vCPU virtual machine.

The following charts show the median and 95th percentile end-to-end latency for an experiment. The request rate was increased at 3:21 AM by removing a sleep statement between requests to show the difference in latency with a large increase in CPU utilization. The commands to create the virtual machines and run the tests are available in the GitHub repository.

Median latency by instance ID chart.

The preceding chart shows the median latency for both client and server during a load experiment. Notice the increase at 3:21 when the load was increased. The following chart shows the 95th percentile latency during the same test.

95th latency by instance ID chart.

The following chart shows the CPU utilization for the test. There is also an increase at 3:21.

CPU utilization chart of Compute Engine instance.

Increasing the request rate effects latency, but not in proportion to the CPU utilization. It's somewhat surprising that the latency is consistent for both 1 vCPU and small instances even though the small instance was running at around 78% CPU utilization. This result suggests that you can run a fleet with a high CPU utilization. A caution here is that this testing isn't intended as a benchmarking study, and your results can vary during app implementation.

Cloud Profiler is a useful tool for investigating instances of high CPU consumption.

Latency due to app code branches and parameter values

Logging can help you solve the most difficult app latency problems. This section demonstrates how to combine log and trace data to investigate latency problems that are influenced by varying code paths and parameter values. A trace to log approach avoids adding a lot of timing statements in your code. The common problem in microservice-based apps is the management of timeouts and retry attempts. Long timeouts can lead to a poor user experience and can topple an app if they tie up too many connections. We recommend setting a short timeout, wait for a while, and then retry. However, aggressive retries can lead to retry storms and make incident recovery difficult. You should set the timeout so that only a small percentage of requests, for example, less than 1%, require retries. However, the test app sets the retry time to a low value for HTTP requests to the test server to demonstrate retries in action without having to wait for a real outage to occur.

Retries are also implemented in the Google API Client Library for Cloud Storage so you don't need to implement that in your own code. For more information about exponential backoff, see Truncated exponential backoff.

With the OpenCensus Monitoring Log Correlation integration, you can view the log entries from traces. This integration uses a Cloud Logging Enhancer to add trace information to logs. You can add it to your app by setting up Logging for Java and adding the opencensus-contrib-log-correlation-stackdriver1 Maven dependency to your build. At the time of writing this document, log integration with Go isn't implemented, but you can request the feature. The current alternative for Go is to add annotations to trace spans with required information.

The method in the test app for preparing, sending, and processing requests to the microservice is as follows:

private void prepareSendProcess(
    HttpClient httpClient,
    HttpMethod method,
    Function<Integer[], Integer> downStreamFn,
    String fnName)
    throws InterruptedException {
  Tracer tracer = Tracing.getTracer();
  try (Scope scope = tracer.spanBuilder("main").startScopedSpan()) {
    StopWatch s = StopWatch.createStarted();
    byte[] content = new byte[0];
    if (method == HttpMethod.POST) {
      content = TestInstrumentation.getContent(testOptions.bucket());
    byte[] payload = sendWithRetry(httpClient, method, content);
    TestInstrumentation.processPayload(payload, downStreamFn, fnName);
        method.toString(), s.getTime(TimeUnit.NANOSECONDS) / 1.0e6);

The method creates a scoped span as a top-level trace for the request that covers the code for all phases of preparation, sending the HTTP request to the server, and downstream processing. The prepareSendProcessmethod calls sendWithRetry, which wraps the HTTP request with a retry mechanism:

private byte[] sendWithRetry(HttpClient httpClient, HttpMethod method, byte[] content)
    throws InterruptedException {
  ExponentialBackOff backoff = new ExponentialBackOff.Builder()
  for (int i = 0; i < MAX_RETRIES; i++) {
    try {
      return sendRequest(httpClient, method, content);
    } catch (RetryableException e) {
      LOGGER.log(Level.WARNING, "RetryableException attempt: " + (i + 1) + " " + e.getMessage());
    } catch (InterruptedException e) {
          Level.WARNING, "InterruptedException attempt: " + (i + 1) + " " + e.getMessage());
    } catch (TimeoutException e) {
      LOGGER.log(Level.WARNING, "TimeoutException attempt: " + (i + 1) + " " + e.getMessage());
    } catch (ExecutionException e) {
      LOGGER.log(Level.WARNING, "ExecutionException attempt: " + (i + 1) + " " + e.getMessage());
    try {
    } catch(IOException e) {
      throw new RuntimeException("MaxElapsedTime exceeded");
  throw new RuntimeException("Max retries exceeded");

The following method sends the HTTP request:

private byte[] sendRequest(HttpClient httpClient, HttpMethod method, byte[] content)
    throws InterruptedException, TimeoutException, ExecutionException, RetryableException {
  String targetURL = testOptions.targetURL();
  HttpRequest request = (HttpRequest) httpClient.newRequest(targetURL).method(method);
  if (request == null) {
    throw new RetryableException("Request is null");
  if (method == HttpMethod.POST) {
    ContentProvider contentProvider =
        new StringContentProvider(new String(content, StandardCharsets.UTF_8));
    request.content(contentProvider, "application/json");
  request.timeout(testOptions.httpTimeout(), TimeUnit.MILLISECONDS);
  ContentResponse response = request.send();
  int status = response.getStatus();"Response status: " + status + ", " + method);
  if (HttpStatus.isSuccess(status)) {
    byte[] payload = response.getContent();"Response payload: " + payload.length + " bytes");
    return payload;
  } else if (HttpStatus.isServerError(status)) {
    throw new RetryableException(response.getReason());
  return new byte[0];

This method creates an HTTPRequest object and then adds the payload if a POST request is sent. The following details an example trace.

Example trace pf an `HTTPRequest` object.

In Monitoring, when you click Show Logs, the logs are displayed in the trace detail. When you click a Trace Detail icon, as described in Finding and viewing traces, then trace information appears. You can match the log text to the source code to understand which code path was taken to generate the trace viewed as well as discover details of other parameters, such as HTTP method and response. The previous example shows that a timeout occurred in this trace. The span is subsequently long because it's penalized by the backoff. There are also six retry attempts, which is the maximum configured number.

To view trace events, click Show Events.

Trace events with high latency.

The high latency was caused by a large payload. The timeout of 8 ms was too low to handle this large of a payload. The timeout should be reset to something higher.

You can add annotations to trace spans with similar information to log statements. However, adding key information about the different code paths in your app to the trace spans requires that you add code for trace spans in your app's code. If your development team isn't familiar with tracing, it might be easier for them to add this in log statements with JUL or Logback and set up Logging for Java. Plus, you might already have log statements in your code.

To identify the slowest requests, you can generate a trace analysis report, which includes sample traces in different request percentiles, as shown in the following screenshot.

Trace analysis report with sample traces in different request percentiles.

To see detailed traces for the higher percentiles, you can click the sample traces

Combining log statements and trace IDs in BigQuery

To search for more log information, you can query the logs directly in the Logs Explorer with a log filter.

Log query with log filter.

One useful thing about this filter is that it connects three log entries with the same trace ID. You can connect log entries from the same request with trace IDs. Then you can export the log entries to BigQuery with a filter like resource.type="gce_instance".

Log filter for Compute Engine instance.

If you export the logs to a dataset called application_test_dataset, you can explore the logs with the following query:

  TIME(timestamp) AS Time,
  SUBSTR(textPayload, 90) as Message,
  SUBSTR(trace, 31) as trace_id
WHERE REGEXP_CONTAINS(textPayload, "attempt")
Time             Message                            trace_id
10:59:11.782000  WARNING: TimeoutException attempt: 008a0ce...

One row of sample data is shown. The result of this query gives similar information to the log viewer. However, now that the log data is in BigQuery you can do more powerful queries. The following query shows how many requests had a different number of retries:

  COUNT(trace) AS RequestCount
    COUNT(REGEXP_CONTAINS(textPayload, "attempt")) as RetryCount,
  WHERE NOT REGEXP_CONTAINS(trace, '00000000000000000000000000000000')
  Group BY trace
GROUP BY RetryCount
RetryCount RequestCount
8          13
7          180
6          2332
5          242
4          507
3          416605
1          1

From the query results, you can see that there were 13 requests with the maximum of 8 retry counts. You can use the results of this query to refine your retry parameters. For example, you can increase the timeout, vary the timeout with payload size, or increase the maximum number of retries.

What's next