Introduction to reliability
This guide provides an understanding of BigQuery reliability features, such as availability, durability, data consistency, consistency of performance, data recovery, and a review of error handling considerations.
This introduction helps you address three primary considerations:
- Determine whether BigQuery is the right tool for your job.
- Understand the dimensions of BigQuery reliability.
- Identify specific reliability requirements for specific use cases.
Select BigQuery
BigQuery is a fully managed enterprise Data Warehouse built to store and analyze massive datasets. It provides a way to ingest, store, read, and query megabytes to petabytes of data with consistent performance without having to manage any of the underlying infrastructure. Because of its power and performance, BigQuery is well suited to be used in a range of solutions. Some of these are documented in detail as reference patterns.
Generally, BigQuery is very well suited for workloads where large amounts of data are being ingested and analyzed. Specifically, it can be effectively deployed for use cases such as real-time and predictive data analytics (with streaming ingestion and BigQuery ML), anomaly detection, and other use cases where analyzing large volumes of data with predictable performance is key. However, if you are looking for a database to support Online Transaction Processing (OLTP) style applications, you should consider other Google Cloud services such as Cloud Spanner, Cloud SQL, or Cloud Bigtable that may be better suited for these use cases.
Dimensions of reliability in BigQuery
Availability
Availability defines the user's ability to read data from BigQuery or write data to it. BigQuery is built to make both of these highly available with a 99.99% SLA. There are two components involved in both operations:
- The BigQuery service
- Compute resources required to execute the specific query
Reliability of the service is a function of the specific BigQuery API being used to retrieve the data. The availability of compute resources depends on the capacity available to the user at the time when the query is run. See Understand slots for more information about the fundamental unit of compute for BigQuery and the resulting slot resource economy. For more information, see Reliability: Query data.
Durability
Durability is discussed in the Implementing SLOs chapter of the SRE Workbook and is described as the "proportion of data that can be successfully read."
For more information about disaster preparedness, see Reliability: Disaster recovery.
Data consistency
Consistency defines the expectations that users have for how the data is able to be queried once it's written or modified. One aspect of data consistency is ensuring "exactly-once" semantics for data ingestion. For more information, see Reliability: Import data and Retry failed job insertions.
Consistency of performance
In general, performance can be expressed in two dimensions. Latency is a measure of the execution time of long data retrieval operations such as queries. Throughput is a measure of how much data BigQuery can process during a specific period of time. Due to BigQuery's multi-tenant, horizontally scalable design, its throughput can scale up to arbitrary data sizes. The relative importance of latency and throughput is determined by the specific use case. For more information, see Reliability: Query data.
Data recovery
Two ways to measure the ability to recover data after an outage are:
- Recovery Time Objective (RTO). How long data can be unavailable after an incident.
- Recovery Point Objective (RPO). How much of the data collected prior to the incident can acceptably be lost.
These considerations are specifically relevant in the unlikely case that a zone or region experiences a multi-day or destructive outage. For more information on this and how to plan and prepare for such events, see Reliability: Query data.
Error handling
The following are best practices for addressing errors that affect reliability.
Retry failed API requests
Clients of BigQuery, including client libraries and partner tools, should use truncated exponential backoff when issuing API requests. This means that if a client receives a system error or a quota error, it should retry the request up to a certain number of times, but with a random and increasing backoff interval.
Employing this method of retries makes your application much more robust in the face of errors. Even under normal operating conditions, you can expect on the order of one in ten thousand requests to fail as described in BigQuery's 99.99% availability SLA. Under abnormal conditions, this error rate may increase, but if errors are randomly distributed the strategy of exponential backoff can mitigate all but the most severe cases.
If you encounter a scenario where a request fails persistently with a 5XX error, then you should escalate to Google Cloud Support. Be sure to clearly communicate the impact the failure is having on your business so that the issue can be triaged correctly. If, on the other hand, a request persistently fails with a 4XX error, the problem should be addressable by changes to your application. Read the error message for details.
Exponential backoff logic example
Exponential backoff logic retries a query or request by increasing the wait time between retries up to a maximum backoff time. For example:
Make a request to BigQuery.
If the request fails, wait 1 + random_number_milliseconds seconds and retry the request.
If the request fails, wait 2 + random_number_milliseconds seconds and retry the request.
If the request fails, wait 4 + random_number_milliseconds seconds and retry the request.
And so on, up to a (
maximum_backoff
) time.Continue to wait and retry up to a maximum number of retries, but don't increase the wait period between retries.
Note the following:
The wait time is
min(((2^n)+random_number_milliseconds), maximum_backoff)
, withn
incremented by 1 for each iteration (request).random_number_milliseconds
is a random number of milliseconds less than or equal to 1000. This randomization helps to avoid situations where many clients are synchronized and all retry simultaneously, sending requests in synchronized waves. The value ofrandom_number_milliseconds
is recalculated after each retry request.The maximum backoff interval (
maximum_backoff
) is typically 32 or 64 seconds. The appropriate value formaximum_backoff
depends on the use case.
The client can continue retrying after it reaches the maximum backoff time. Retries after this point don't need to continue increasing backoff time. For example, if the client uses a maximum backoff time of 64 seconds, then after reaching this value the client can continue to retry every 64 seconds. At some point, clients should be prevented from retrying indefinitely.
The wait time between retries and the number of retries depend on your use case and network conditions.
Retry failed job insertions
If exactly-once insertion semantics are important for your application, there are additional considerations when it comes to inserting jobs. How to achieve at most once semantics depends on which WriteDisposition you specify. The write disposition tells BigQuery what it should do when encountering existing data in a table: fail, overwrite or append.
With a WRITE_EMPTY
or WRITE_TRUNCATE
disposition, this is achieved by
simply retrying any failed job insertion or execution. This is because all rows
ingested by a job are atomically written to the table.
With a WRITE_APPEND
disposition, the client needs to specify the job ID to
guard against a retry appending the same data a second time. This works because
BigQuery rejects job creation requests that attempt to use an
ID from a previous job. This achieves at-most-once semantics for any given job
ID. You can then achieve exactly-once by retrying under a new predictable job ID
once you've confirmed with BigQuery that all previous attempts
have failed.
In some cases, the API client or HTTP client might not receive the confirmation that the job is inserted due to transient issues or network interruptions. When the insertion is retried, that request fails with status=ALREADY_EXISTS
(code=409
and reason="duplicate"
). The existing job status can be retrieved with a call to jobs.get
. After the status of the existing job is retrieved
, the caller can determine whether a new job with a new JOB ID should be created.
Use cases and reliability requirements
BigQuery might be a critical component of a variety of architectures. Depending on the use case and architecture deployed, a variety of availability, performance, or other reliability requirements might need to be met. For the purposes of this guide, let's select two primary use cases and architectures to discuss in detail.
Real-time analytics
The first example is an event data processing pipeline. In this example, log events from endpoints are ingested using Pub/Sub. From there, a streaming Dataflow pipeline performs some operations on the data prior to writing it into BigQuery using the Storage Write API. The data is then used both for ad hoc querying to, for example, recreate sequences of events that may have resulted in specific endpoint outcomes, and for feeding near-real time dashboards to allow the detection of trends and patterns in the data through visualization.
This example requires you to consider multiple aspects of reliability. Because the end-to-end data freshness requirements are quite high, latency of the ingestion process is critical. Once data is written to BigQuery, reliability is perceived as the ability of users to issue ad hoc queries with consistent and predictable latency and ensuring that dashboards utilizing the data reflect the absolute latest available information.
Batch data processing
The second example is a batch processing architecture based around regulatory compliance in the financial services industry. A key requirement is to deliver daily reports to regulators by a fixed nightly deadline. As long as the nightly batch process that generates the reports completes by this deadline, it is considered sufficiently fast.
Data needs to be made available in BigQuery and joined with other data sources for dashboarding, analysis, and ultimately generation of a PDF report. Having these reports be delivered on time and without error is a critical business requirement. As such, ensuring the reliability of both data ingestion and producing the report correctly and in a consistent timeframe to meet required deadlines are key.
What's next
- Learn about disaster planning.