This document helps you identify the causes of tail latency by using OpenCensus and Monitoring to monitor metrics and distributed tracing for app developers.
When you observe how your app serves users, you find that some users experience higher app latency than others. The differences are amplified with more diverse populations of users and higher usage of resources. Identifying and addressing the sources of high latency can help you expand your user base to a larger population of users with more optimal levels of resource usage. If you increase resource usage, tail latency is the first thing that becomes apparent due to resource contention.
Although the users experiencing these higher latencies might be small, they can also be important, for example, if they represent the leading edge of a new potential market.
This document discusses how to identify and explain the sources of high latency that these users experience. In particular, the document answers the following questions:
- How can you measure tail latency accurately?
- What are the main causes of app latency?
- How can you evaluate strategies for minimizing tail latency?
The main themes that answer these questions are:
- Identifying the causes of inaccuracy in the estimates of monitoring metrics.
- Identifying the causes of high latency.
- Minimizing tail latency.
Source code for the test app is provided in a GitHub repository.
This document assumes that you are familiar with basic use of OpenCensus, Cloud Monitoring, Cloud Trace, BigQuery, and app development in Java.
Terminology
- metric
- A measure of app performance.
- latency
- The time it takes the app to perform a task. The latency values in this document are measured from within the app's code.
- tail latency
- The higher latency values experienced by a small set of users. Tail latency can be defined at percentile levels. For example, the 99th percentile is the highest 1% of latency values for requests measured over a time interval. For more information, see the "Worrying about your Tail" section in the "Monitoring Distributed Systems" chapter of Site Reliability Engineering.
- aggregation model
- A representation of monitoring metrics that shows the statistical distribution of the measurements collected. This document assumes a histogram model with the range of latencies divided into buckets and the counts for each bucket recorded for each time interval. Monitoring uses the term distribution metric for metrics with this kind of representation.
Background
Cloud app owners often monitor a combination of median and tail latency values. The median latency is a good indicator for the overall health of your app. However, as your app scales up, subtle problems can appear that aren't readily observable through the median latency.
Understanding tail latency can be useful for a number of purposes:
- Detecting emerging problems early that aren't apparent in the median latency.
- Making sure that high utilization isn't achieved at the cost of suboptimal handling of some requests.
- Monitoring a fleet when a single server is unhealthy or when a code path that only a minority of users follow isn't handled well.
- Informing strategies for minimizing tail latency, such as setting appropriate values for timeouts and backoff retry. You may adjust the timeout to slightly higher than the 99th percentile latency to abandon connections and retry rather than leaving requests open for extended times.
There can be many causes of tail latency. Typical causes are:
- Variation in request payload size
- High CPU utilization
- Non-optimal application code structure
- Variation in routing distance of network calls
- Occasional unexpected server reboots with associated broken connections and cold starts
- Buffering or queuing
- Connection management, such as reestablishment of stale connections
- Network congestion, including packet loss
- Garbage collection or compaction of data leading to suspension of handling of user requests
- Variable latency of cloud services
- Non-optimal settings for connection timeout
Strategies in identifying causes of latency include:
- Establish methods for accurate latency measurement: Because tail latency by definition is a small proportion of traffic, the accuracy of measurements needs to be understood.
- Implement monitoring, tracing, and logging solutions: By using OpenCensus and Monitoring, a root span trace is created by the app for requests to contain child spans for different measurements. Child spans are then automatically generated with the OpenCensus HTTP integration and trace to logs integration is added for recording additional context. This minimizes additional code needed for monitoring and tracing.
- Explore the data: Look at reports and charts that show the tail of long running requests and identify which child spans take the longest to execute.
- Identify the reasons for tail latency: Gather information from related sources that can help explain the reasons for long child spans and compare that with annotations in sample spans.
- Undestand more detailed queries: Export logs to BigQuery and combine with trace information to drill down and quantify app behavior.
The analysis discussed in this document is the result of analyzing high traffic systems scaled out with Cloud Storage and microservices. The primary causes of variability in tail latency were measurement errors, size of payload, variable latency in cloud services, and load on the virtual machines hosting the app.
Cloud users have a choice between using metrics from prebuilt integrations included in libraries, supported metrics from cloud providers, and custom metrics. Examples of metrics from prebuilt integrations are the gRPC, HTTP, and SQL integrations from OpenCensus. Google Cloud includes a large number of supported metrics, which are exposed through Monitoring. Another alternative is to develop code for additional custom metrics, which are often timers and counters. Comparing time series from these three sources is a common task in analyzing source of latency in microservices and storage of app data.
Overview of the test app
The test app simulates tail latency problems and integrates with OpenCensus and Monitoring to demonstrate the tools and approaches needed to identify the problems. The source code, build files, and instructions are provided in the GitHub project. The example app is written in Java, but you can apply this document to other languages. You can find out more about the OpenCensus instrumentation of HTTP requests in Jetty Integration for Java and Go. The test app is based on this GitHub example code.
The following is a schematic diagram of the app.
The client app communicates with Cloud Storage and a server-side app hosted on a Compute Engine instance.
The following is a typical request flow:
- The Cloud Storage API retrieves a JSON file from Cloud Storage in the client app.
- The contents of the file are sent to the server.
- The server does some processing of data sent by the client to simulate the time and resources taken by a real app.
- An HTTP response is sent back to the client.
- The client performs downstream processing. At this point, the test app substitutes simple arithmetic for business logic.
At startup, the app initializes the following:
- The client Java Util Logging (JUL) to export Cloud Logging with trace IDs included. The server uses a logback logging appender for trace-log correlation.
- OpenCensus HTTP metrics views are registered. The OpenCensus HTTP integration for Java, includes a number of metrics views for latency, HTTP response code, and bytes sent and received.
- The OpenCensus
Monitoring
stats
exporter sends aggregated metrics to Monitoring in a background thread. - The OpenCensus Monitoring trace exporter sends trace spans to Monitoring in a background thread.
- The Cloud Storage client API downloads files that are used for HTTP payloads.
- An app thread pool is created to drive load to the app. Multiple threads drive a higher load to the server than a single thread does.
Monitoring metrics and trace data are collected and processed using the following steps:
- OpenCensus libraries collect metrics and trace data.
- OpenCensus periodically aggregates metrics data and exports it to Monitoring. Trace data is exported to Trace without aggregation.
- OpenCensus adds trace IDs to Cloud Logging entries. Log entries are exported to the Cloud Logging service.
You can export the metrics collected with OpenCensus to a number of backend monitoring systems that provide storage, analysis, visualization, and alerting. Each system has a different data representation format. This document focuses on Monitoring.
Accuracy of aggregated monitoring metrics
OpenCensus and Monitoring provide tools that you can use to monitor app performance. To avoid a false sense of confidence in lines on charts, you need to understand how they were created, especially for lines that represent tail latency, such as the 99th percentile (p99). The following sections examine the limitations of aggregating metrics data.
Representation of metrics time series in Monitoring
Monitoring computes latency percentiles for
distribution metrics
based on bucket boundaries at numeric intervals. This method is a common method
used by Monitoring and OpenCensus where OpenCensus represents
and exports metrics data to
Monitoring. The TimeSeries.list
method for the
Cloud Monitoring API returns the bucket counts and boundaries for your project and
metric types. You can retrieve the bucket boundaries in the Cloud Monitoring API
BucketOptions
object, which you can experiment with in the API Explorer for
TimeSeries.list
.
To filter for the OpenCensus HTTP client latency in the API Explorer, you can use the following filter:
resource.type=global metric.type="custom.googleapis.com/opencensus/opencensus.io/http/client/roundtrip_latency"
The result includes an array of data point objects. The following example shows the scheme of buckets and bucket counts:
"points": [ { "interval": { "startTime": "2019-02-14T18:11:24.971Z", "endTime": "2019-02-14T18:33:35.541Z" }, "value": { "distributionValue": { ... "explicitBuckets": { "bounds": [0, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 20, 25, 30, 40, 50, 65, 80, 100, 130, ...] } }, "bucketCounts": [0, 0, 0, 0, 0, 14, 17, 164, 3, 0, 1, 0, 0, 0, 0, ...] }}]
Each point that you see in a Monitoring chart for a latency value is derived from a distribution. You can define the buckets by using a growth factor, instead of defining boundaries for each bucket. For example, a growth factor of 2.0 would give bucket boundaries of 0-1, 1-2, 2-4, 4-8,... For more information about Monitoring metrics, see Structure of metric types.
You can edit the chart to show lines for the 50th (median), 95th, and 99th percentile latency to display the following line chart.
To understand the data that is used to calculate these percentile values, add the Heatmap chart type. When the Aggregator is set to sum, the following client latency heatmap is displayed.
Heatmaps are a way of displaying the distribution data stored by Monitoring. The areas of the heatmap correspond the bucket boundaries retrieved by using the REST API. You can hold the pointer over the areas of the heatmap to find out how many points are contained in any given bucket for a time period. The data used to compute the percentile values in the line chart is sparse.
If you want more accurate latency measurements, a few strategies are available:
- Add more data. You can generate more data in experiments, but traffic to a production app might not be high enough.
- Make the buckets smaller, and change the boundaries to better suit your data. This is an option if you're creating your own custom metrics. However, you might not be able to make these adjustments if you're using data from prebuilt OpenCensus integrations or collected by Google Cloud from its backend services. The bucket boundaries for the OpenCensus integrations are determined by implementers of the OpenCensus integrations in Views. For example, the HTTP client latency bucket boundaries are defined in the class HttpViewConstants.java. Similarly, the Google Cloud metrics are defined internally in Monitoring.
- Use traces or logs instead of metrics data for tail percentiles.These options are explored in later sections of this document.
Simulated data
You can compare the accuracy of aggregation models with raw simulated data
drawn from a random distribution to see how much difference the bucket
boundaries can make. In this example, a
beta
distribution
models the skewed distribution typical of real latency values. This is a
distribution of a continuous random variable with no values less than zero, a
distinct peak, and a long decreasing tail. The beta distribution uses the
random.beta
function from
NumPy
with parameters a = 2, b = 5
, and is multiplied by a constant factor of 70 to
simulate realistic latency values with a median value of 18.5 ms. The simulation
includes 100,000 random values from the beta distribution in the Colab sheet
tail_latency_lab.ipynb
included in the GitHub project.
The following chart shows that buckets are too sparsely spaced with
growthFactor = 2.0
to give accurate estimates of tail latencies.
The linear axis on chart shows the intended shape of the random distribution.
The following chart shows the bucket boundaries with growthFactor = 1.4
.
The following chart shows that the bucket boundaries with growthFactor = 1.2
resemble the intended shape more closely, but it's still somewhat coarse at the
high end of the distribution.
The following table shows estimates based on the bucket-based distribution compared to a computation based on raw values.
Distribution | Median | 99th percentile |
---|---|---|
Raw | 18.5 | 49.5 |
Growth factor 2.0 | 19.2 | 62.0 |
Growth factor 1.4 | 18.7 | 53.9 |
Growth factor 1.2 | 18.6 | 51.1 |
In this case, a growth factor of 1.2 is required to give a reasonably accurate
estimate of the 99th percentile latency. From within the OpenCensus exporter,
you can convert the values from raw values to a sampled distribution. When you
create an
OpenCensus view,
you can configure bucket boundaries. Views contain an Aggregation
object. For
more information, see the
Java documentation
or the
Go documentation.
Measurement of HTTP latency with OpenCensus
When measuring HTTP latency, you measure the time to send and receive an HTTP
request from a client to a server and record the result. You can implement this
in code either by creating an
OpenCensus Tag
or by using the
prebuilt integration
for HTTP clients. The HTTP integration is provided for Go with the Go HTTP
package and for Java with the Jetty HTTP client. The advantage of using the
OpenCensus HTTP integration is that you don't have to add Tag
objects for
every metric that you want to collect.
Identifying sources of latency
The previously described test app generated stress on itself that led to increased tail latency. The following problems are demonstrated:
- Occasionally large payloads
- High CPU
- Short timeouts
- Variations in app code path with retries
The following sections describe how to detect and quantify these problems.
Effect of large payload
To investigate the effect of payload size, the following code randomly increased the size of the payload sent in 1 out of 20 HTTP requests:
When you view the traces for higher latency requests in the Trace list, they can be correlated to larger payload size.
If you click Show Events, you can view more details about the payload sent and received. If there is high latency, it shows a larger payload size.
For the following reasons, Cloud Trace is better than Monitoring at identifying latency:
- Trace gives latency values for requests instead of aggregated values.
- Trace gives the payload size for the request.
Effect of high CPU utilization
High CPU utilization can cause increased tail latency due to concurrent processing, which can result in resource contention and subsequent queuing of the CPU to handle requests. Monitoring provides CPU utilization as one of the built-in metrics for Compute Engine instances. However, when you run performance tests or investigate performance problems, we recommend that you install Monitoring agent because it provides more operating systems and open source metrics, including memory, disk usage, and other metrics that can show different types of resource contention. For more information, see the list of Agent metrics.
Managing CPU utilization is easier for server apps than client apps due to autoscaling. On Compute Engine, managed instance groups can use autoscaling policies based on average CPU utilization, load balancing serving capacity, or Monitoring metrics. Those policies add virtual machines to size the fleet appropriately to handle the load. Client CPU utilization can be harder to manage consistently and can affect the measurement of latency because payloads must be read from the network and deserialized, which requires CPU utilization. Because the client app view of latency determines user experience to a greater degree, this document focuses on that type of latency.
Monitoring is more appropriate than tracing to identify the effect of high CPU
utilization because high CPU isn't generally the result of one HTTP
request. In the following example, the effect of CPU utilization is investigated
by using the test app to run multiple executor threads at once and running the
client code on a
g1-small
virtual machine. Then the CPU utilization and latency is compared to values from
the client app running on a
n1-standard-1
vCPU virtual machine.
The following charts show the median and 95th percentile end-to-end latency for an experiment. The request rate was increased at 3:21 AM by removing a sleep statement between requests to show the difference in latency with a large increase in CPU utilization. The commands to create the virtual machines and run the tests are available in the GitHub repository.
The preceding chart shows the median latency for both client and server during a load experiment. Notice the increase at 3:21 when the load was increased. The following chart shows the 95th percentile latency during the same test.
The following chart shows the CPU utilization for the test. There is also an increase at 3:21.
Increasing the request rate effects latency, but not in proportion to the CPU utilization. It's somewhat surprising that the latency is consistent for both 1 vCPU and small instances even though the small instance was running at around 78% CPU utilization. This result suggests that you can run a fleet with a high CPU utilization. A caution here is that this testing isn't intended as a benchmarking study, and your results can vary during app implementation.
Cloud Profiler is a useful tool for investigating instances of high CPU consumption.
Latency due to app code branches and parameter values
Logging can help you solve the most difficult app latency problems. This section demonstrates how to combine log and trace data to investigate latency problems that are influenced by varying code paths and parameter values. A trace to log approach avoids adding a lot of timing statements in your code. The common problem in microservice-based apps is the management of timeouts and retry attempts. Long timeouts can lead to a poor user experience and can topple an app if they tie up too many connections. We recommend setting a short timeout, wait for a while, and then retry. However, aggressive retries can lead to retry storms and make incident recovery difficult. You should set the timeout so that only a small percentage of requests, for example, less than 1%, require retries. However, the test app sets the retry time to a low value for HTTP requests to the test server to demonstrate retries in action without having to wait for a real outage to occur.
Retries are also implemented in the Google API Client Library for Cloud Storage so you don't need to implement that in your own code. For more information about exponential backoff, see Truncated exponential backoff.
With the
OpenCensus Monitoring Log Correlation
integration, you can view the log entries from traces. This integration uses a
Cloud Logging Enhancer
to add trace information to logs. You can add
it to your app by setting up
Logging for Java
and adding the
opencensus-contrib-log-correlation-stackdriver
1
Maven dependency to your build. At the time of writing this document, log
integration with Go isn't implemented, but you can
request the feature.
The current alternative for Go is to add annotations to trace spans with
required information.
The method in the test app for preparing, sending, and processing requests to the microservice is as follows:
The method creates a scoped span as a top-level trace for the request that
covers the code for all phases of preparation, sending the HTTP request to the
server, and downstream processing. The prepareSendProcess
method calls
sendWithRetry
, which wraps the HTTP request with a retry mechanism:
The following method sends the HTTP request:
This method creates an HTTPRequest
object and then adds the payload if a
POST
request is sent. The following details an example trace.
In Monitoring, when you click Show Logs, the logs are displayed in the trace detail. When you click a Trace Detail icon, as described in Finding and viewing traces, then trace information appears. You can match the log text to the source code to understand which code path was taken to generate the trace viewed as well as discover details of other parameters, such as HTTP method and response. The previous example shows that a timeout occurred in this trace. The span is subsequently long because it's penalized by the backoff. There are also six retry attempts, which is the maximum configured number.
To view trace events, click Show Events.
The high latency was caused by a large payload. The timeout of 8 ms was too low to handle this large of a payload. The timeout should be reset to something higher.
You can add annotations to trace spans with similar information to log statements. However, adding key information about the different code paths in your app to the trace spans requires that you add code for trace spans in your app's code. If your development team isn't familiar with tracing, it might be easier for them to add this in log statements with JUL or Logback and set up Logging for Java. Plus, you might already have log statements in your code.
To identify the slowest requests, you can generate a trace analysis report, which includes sample traces in different request percentiles, as shown in the following screenshot.
To see detailed traces for the higher percentiles, you can click the sample traces
Combining log statements and trace IDs in BigQuery
To search for more log information, you can query the logs directly in the Cloud Logging Log Viewer with a log filter.
One useful thing about this filter is that it connects three log entries with
the same trace ID. You can connect log entries from the same request with trace
IDs. Then you can export the log entries to BigQuery with a
filter like resource.type="gce_instance"
.
If you export the logs to a dataset called application_test_dataset
, you can
explore the logs with the following query:
SELECT
TIME(timestamp) AS Time,
SUBSTR(textPayload, 90) as Message,
SUBSTR(trace, 31) as trace_id
FROM
`application_test_dataset.java_log_20190328`
WHERE REGEXP_CONTAINS(textPayload, "attempt")
LIMIT 10;
...
Time Message trace_id
10:59:11.782000 WARNING: TimeoutException attempt: 008a0ce...
One row of sample data is shown. The result of this query gives similar information to the log viewer. However, now that the log data is in BigQuery you can do more powerful queries. The following query shows how many requests had a different number of retries:
SELECT
RetryCount,
COUNT(trace) AS RequestCount
FROM(
SELECT
COUNT(REGEXP_CONTAINS(textPayload, "attempt")) as RetryCount,
trace
FROM
`application_test_dataset.java_log_20190328`
WHERE NOT REGEXP_CONTAINS(trace, '00000000000000000000000000000000')
Group BY trace
)
GROUP BY RetryCount
ORDER BY RetryCount DESC
LIMIT 10;
...
RetryCount RequestCount
8 13
7 180
6 2332
5 242
4 507
3 416605
1 1
From the query results, you can see that there were 13 requests with the maximum of 8 retry counts. You can use the results of this query to refine your retry parameters. For example, you can increase the timeout, vary the timeout with payload size, or increase the maximum number of retries.
What's next
Review the following:
- Other integrations and APIs for OpenCensus.
- Other supported metrics and APIS at Monitoring documentation.
Read more about the Java HTTP Integration for OpenCensus.
Read the open source code and examples at the OpenCensus GitHub project.
Query logs in BigQuery in Using exported logs.
Try out other Google Cloud features for yourself. Have a look at our tutorials.