Databases

Step-by-step guide to resolve DEADLINE_EXCEEDED errors on Cloud Spanner

When accessing Cloud Spanner APIs, requests may fail due to “Deadline Exceeded” errors. This error indicates that a response has not been obtained within the configured timeout.

A Deadline Exceeded error may occur for several different reasons, such as overloaded Cloud Spanner instances, unoptimized schemas, or unoptimized queries. This post describes some of the common scenarios where a Deadline Exceeded error can happen and provide tips on how to investigate and resolve these issues.

Cloud Spanner’s deadline and retry philosophy

Cloud Spanner’s deadline and retry philosophy differs from many other systems. In Cloud Spanner, users should specify the deadline as the maximum amount of time in which a response is useful. An artificially short deadline just to immediately retry the same operation again is not recommended, as this will lead to situations where operations never complete. In this context, the following strategies are counterproductive and defeat Cloud Spanner’s internal retry behavior:

  • Setting a deadline of 1 second for an operation that takes 2 seconds to complete is not useful, as no number of retries will return a successful result.

  • Canceling and retrying an operation leads to wasted work on each try. In aggregate, this can create significant additional load on the user instance.

What is a DEADLINE_EXCEEDED error?

When users use one of the Cloud Spanner client libraries, the underlying gRPC layer takes care of communication, marshaling, unmarshalling, and deadline enforcement. Deadlines allow the user application to specify how long they are willing to wait for a request to complete before the request is terminated with the error DEADLINE_EXCEEDED. 

The following guide demonstrates how users can specify deadlines (or timeouts) in each of the supported Cloud Spanner client libraries. The Cloud Spanner client libraries use default timeout and retry policy settings which are defined in the following configuration files:

Users can learn more about gRPC deadlines here.

Investigating common cases for DeadlineExceeded errors and Resolutions

Cloud Spanner Instance

A Cloud Spanner instance must be appropriately configured for user specific workload. The following sections describe how to identify configuration issues and resolve them.

Check the Cloud Spanner Instance CPU Load

Request latency can significantly increase as CPU utilization crosses the recommended healthy threshold. Users should be able to check the Spanner CPU utilization in the monitoring console provided in the Cloud Console. Alerts can be created, based on the instance’s CPU Utilization.

Resolution

The following guide provides steps to help users reduce the instance’s CPU utilization. 

Check the Request’s End-to-end Latency Breakdown

As a request travels from the client to Cloud Spanner servers and back, there are several network hops that need to be made. From the client library to Google Front End; from the Google Front End to the Cloud Spanner API Front End; and finally from the Cloud Spanner API Front End to the Cloud Spanner Database. If there are network issues at any of these stages, users may see deadline exceeded errors.

It is possible to capture the latency at each stage (see the latency guide). Users can learn more using the following guide on how to diagnose latency issues.

Resolution

From the obtained latency breakdown users can use this decision guide on how to Troubleshoot latency issues.

Data API

Certain non-optimal usage patterns of Cloud Spanner’s data API may result in Deadline Exceeded errors. The next sections provide guidelines on how to check for that.

Check for Expensive Queries

Users might be trying to execute expensive queries that do not fit the configured deadline in the client libraries. Some examples include, but are not limited to, full scans of a large table, cross-joins over several large tables or executing a query with a predicate over a non-key column (also a full table scan). 

Users can inspect expensive queries using the Query Statistics table and the Transaction Statistics table. These tables show information about slow running queries / transactions, such as the average number of rows read, the average bytes read, the average number of rows scanned and more. Moreover, users can generate Query Execution Plans to further inspect how their queries are being executed.

Resolution

The following guide provides best practices for SQL queries. Users can use the data obtained through the above mentioned statistics tables and execution plans to optimize their queries and make schema changes to their databases. This may help reduce the execution time of the statements, potentially getting rid of deadline exceeded errors.

Check for Lock Contention

Spanner transactions need to acquire locks to commit. Applications running at high throughput may cause transactions to compete for the same resources, causing an increased wait to obtain the locks, impacting overall performance. This could result in exceeded deadlines for any read or write requests.

Resolution

Use the Read-Only transactions for plain reads use case to avoid lock conflicts with the writes, for example when reading all songs for a given album which are then displayed on the Albums webpage.  Using read-write transactions should be reserved for the use case of writes or mixed read/write workflow. Users can find the root cause for high latency read-write transactions using the Lock Statistics table and the following blogpost. Within this table, users will be able to see row keys with the highest lock wait times. This Troubleshooting guide goes over finding the transactions that are accessing the columns involved in lock conflicts and the following guide provides the best practices to reduce the lock contention. This should improve the overall latency of transaction execution time and reduce the deadline exceeded errors.

Check for Unoptimized Schemas

Users should consider which queries are going to be executed in Cloud Spanner in order to design an optimal schema. Sub-optimal schemas may result in performance issues for some queries. The penalty might be big enough that it prevents requests from completing within the configured deadline.

Resolution

The optimal schema design will depend on the reads and writes being made to the database. The Schema design best practices and SQL best practices guides should be followed regardless of schema specifics. By following these, users would be able to avoid the most common schema design issues. Some other root causes for poor performance are attributed to choice of primary keys, table layout (using interleaved tables for faster access), optimizing schema for performance and understanding the performance of the node configured within user instance (regional limits, multi-regional limits). 

Check for Hot Spots

Because Cloud Spanner is a distributed database, the schema design needs to account for preventing hot spots (see schema design best practices). For instance, creating monotonically increasing columns will limit the number of splits that Spanner can work with to distribute the workload evenly. These bottlenecks can result in timeouts.

Resolution

Resolving issues pointed in the section above, Unoptimized schema resolution, may be the first step. Users can also prevent hotspots by using the Best Practices guide. Finally, users can leverage the Key Visualizer in order to troubleshoot performance caused by hot spots.

Check for Misconfigured Timeouts

The client libraries provide reasonable defaults for all requests in Cloud Spanner. However, these might need to be adjusted for user specific workload. It is worth observing the cost of user queries and adjusting the deadlines to be suitable to the specific use case.

Resolution

The default settings for timeouts are suitable for most use cases. Users can override these configurations (as shown in Custom timeout and retry guide), but it is not recommended for users to use more aggressive timeouts than the default ones. If a user application has configured timeouts, it is recommended to either use the defaults or experiment with larger configured timeouts.

Admin API

Admin requests are expensive operations when compared to the Data API. Requests like CreateInstance, CreateDatabase or CreateBackups can take many seconds before returning. Our client libraries have high deadlines (60 minutes for both instance and database) for admin requests. This is to ensure the server has the opportunity to complete the request without clients having to retry/fail.

Admin operations might take long also due to background work that Cloud Spanner needs to do. For instance, when creating a secondary index in an existing table with data, Cloud Spanner needs to backfill index entries for the existing rows.

Resolution

If customers are experiencing Deadline Exceeded errors while using the Admin API, it is recommended to observe the Cloud Spanner Instance CPU Load. Users need to make sure the instance is not overloaded in order to complete the admin operations as fast as possible.

Cloud Console

Queries issued from the Cloud Console query page may not exceed 5 minutes. If the user creates an expensive query that goes beyond this time, they will see an error message in the UI itself like so:

DEADLINE_EXCEEDED.jpg

The failed queries will be canceled by the backend, possibly rolling back the transaction if necessary.

Resolution

Customers can rewrite the query using the best practices for SQL queries.

Dataflow

In Apache Beam, the default timeout configuration is 2 hours for read operations and 15 seconds for commit operations. This configuration is to allow for longer operations when compared to the standalone client library. However, it is still possible to get timeouts when the work items are too large. Currently, it is only possible to customize the commit timeout configuration if necessary.

Resolution

If a Deadline Exceeded error is occurring in the steps ReadFromSpanner / Execute query / Read from Cloud Spanner / Read from Partitions, it is recommended to check the query statistics table to find out which query scanned a large number of rows. The user can then modify such queries to try and reduce the execution time.

The user can also see an error such as this example exception:

  exception:
     org.apache.beam.sdk.util.UserCodeException:
     com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED:
     io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after
     3599.999905380s.
     [remote_addr=batch-spanner.googleapis.com/172.217.5.234:443] at
     org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:184)

These timeouts are caused due to work items being too large. In the above case the following two recommendations may help. Firstly, the user can try enabling the shuffle service if it is not yet enabled. Secondly, it is recommended trying to tweak configurations in Spanner Read, such as maxPartitions and partitionSizeBytes (more information here) to try and reduce the work item size. An example of how to do this can be found here

Additional Troubleshooting Resources

Once the above is followed and customers are still seeing deadline exceeded errors, the breakdown of the end-to-end latency will help determine if customers need to open a support case (see full list in Troubleshoot latency issues): 

  • If customers see a high Google Front End latency, but low Cloud Spanner API request latency, customers should open a support ticket.

  • If customers see a high Cloud Spanner API request latency, but a low query latency, customers should open a support ticket.

Customers can also use following additional resources: