Stay organized with collections Save and categorize content based on your preferences.

Introduction to reliability

This guide provides an understanding of BigQuery reliability features.

The topics include the following:

  • An overview of reliability features - Insight into availability, durability, data consistency, consistency of performance, and data recovery in BigQuery, and a review of error handling considerations.
  • Importing data - An in-depth look at how these dimensions of reliability apply to importing data into BigQuery's managed storage system.
  • Querying data - How to reliably query data within your BigQuery environment.
  • Reading data - How reliability applies to reading data out of BigQuery's managed storage system.
  • Disaster recovery - Specific failures from the loss of an individual machine all the way through catastrophic loss of a region.

This introduction helps you address three primary considerations:

  • Determine whether BigQuery is the right tool for your job.
  • Understand the dimensions of BigQuery reliability
  • Identify specific reliability requirements for specific use cases.

Selecting BigQuery

BigQuery is a fully managed enterprise Data Warehouse built to store and analyze massive datasets. It provides a way to ingest, store, read, and query megabytes to petabytes of data with consistent performance without having to manage any of the underlying infrastructure. Because of its power and performance, BigQuery is well suited to be used in a range of solutions. Some of these are documented in detail as reference patterns.

Generally, BigQuery is very well suited for workloads where large amounts of data are being ingested and analyzed. Specifically, it can be effectively deployed for use cases such as real-time and predictive data analytics (with streaming ingestion and BigQuery ML), anomaly detection, and other use cases where analyzing large volumes of data with predictable performance is key. However, if you are looking for a database to support Online Transaction Processing (OLTP) style applications, you should consider other Google Cloud services such as Cloud Spanner, Cloud SQL, or Cloud Bigtable that may be better suited for these use cases.

Dimensions of reliability in BigQuery

Availability

Availability defines the user's ability to read data from BigQuery or write data to it. BigQuery is built to make both of these highly available with a 99.99% SLA. There are two components involved in both operations:

  • The BigQuery service
  • Compute resources required to execute the specific query

Reliability of the service is a function of the specific BigQuery API being used to retrieve the data. The availability of compute resources depends on the capacity available to the user at the time when the query is run. See Understand slots for more information about the fundamental unit of compute for BigQuery and the resulting slot resource economy. For more information, see Reliability: Query data.

Durability

Durability is discussed in the Implementing SLOs chapter of the SRE Workbook and is described as the "proportion of data that can be successfully read." For more information about BigQuery's approach to data storage, see Storage management. To take additional steps to protect your data, spread your data across additional failure domains. For more information about disaster preparedness, see Reliability: Disaster recovery.

Data consistency

Consistency defines the expectations that users have for how the data is able to be queried once it's written or modified. One aspect of data consistency is ensuring "exactly-once" semantics for data ingestion. For more information, see Reliability: Import data and Retrying failed job insertions.

Consistency of performance

In general, performance can be expressed in two dimensions. Latency is a measure of the execution time of long data retrieval operations such as queries. Throughput is a measure of how much data BigQuery can process during a specific period of time. Due to BigQuery's multi-tenant, horizontally scalable design, its throughput can scale up to arbitrary data sizes. The relative importance of latency and throughput is determined by the specific use case. For more information, see Reliability: Query data.

Data recovery

Two ways to measure the ability to recover data after an outage are:

  • Recovery Time Objective (RTO). How long data can be unavailable after an incident.
  • Recovery Point Objective (RPO). How much of the data collected prior to the incident can acceptably be lost.

These considerations are specifically relevant in the unlikely case that a zone or region experiences a multi-day or destructive outage. For details, see Availability and recovery. For more information on this and how to plan and prepare for such events, see Reliability: Query data.

Error handling

The following are best practices for addressing errors that affect reliability.

Retrying failed API requests

Clients of BigQuery, including client libraries and partner tools, should use truncated exponential backoff when issuing API requests. This means that if a client receives a system error or a quota error, it should retry the request up to a certain number of times, but with a random and increasing backoff interval.

Employing this method of retries will make your application much more robust in the face of errors. Even under normal operating conditions, you can expect on the order of one in ten thousand requests to fail as described in BigQuery's 99.99% availability SLA. Under abnormal conditions, this error rate may increase, but if errors are randomly distributed the strategy of exponential backoff can mitigate all but the most severe cases.

If you encounter a scenario where a request fails persistently with a 5XX error, then you should escalate to Google Cloud Support. Be sure to clearly communicate the impact the failure is having on your business so that the issue can be triaged correctly. If, on the other hand, a request persistently fails with a 4XX error, the problem should be addressable by changes to your application. Read the error message for details.

Retrying failed job insertions

If exactly-once insertion semantics are important for your application, there are additional considerations when it comes to inserting jobs. How to achieve at most once semantics depends on which WriteDisposition you specify. The write disposition tells BigQuery what it should do when encountering existing data in a table: fail, overwrite or append.

With a WRITE_EMPTY or WRITE_TRUNCATE disposition, this is easily achieved by simply retrying any failed job insertion or execution. This is because all rows ingested by a job are atomically written to the table.

With a WRITE_APPEND disposition, the client needs to specify the job ID to guard against a retry appending the same data a second time. This works because BigQuery will reject job creation requests that attempt to use an ID from a previous job. This achieves at-most-once semantics for any given job ID. You can then achieve exactly-once by retrying under a new predictable job ID once you've confirmed with BigQuery that all previous attempts have failed.

In some cases, the API client or HTTP client might not receive the confirmation that the job is inserted due to transient issues or network interruptions. When the insertion is retried, that request will fail with status=ALREADY_EXISTS (code=409 and reason="duplicate"). The existing job status can be retrieved with a call to jobs.get. After the status of the existing job is retrieved, the caller can determine whether a new job with a new JOB ID should be created.

Use cases and reliability requirements

As discussed in the Selecting BigQuery section, BigQuery may be a critical component of a variety of architectures. Depending on the use case and architecture deployed, a variety of availability, performance, or other reliability requirements may need to be met. For the purposes of this guide, let's select two primary use cases and architectures to discuss in detail.

Real-time analytics

The first example is an event data processing pipeline. In this example, log events from endpoints are ingested using Pub/Sub. From there, a streaming Dataflow pipeline performs some operations on the data prior to writing it into BigQuery using the Storage Write API. The data is then used both for ad-hoc querying to, for example, recreate sequences of events that may have resulted in specific endpoint outcomes, and for feeding near-real time dashboards to allow the detection of trends and patterns in the data through visualization.

This example requires us to consider multiple aspects of reliability. Because the end-to-end data freshness requirements are quite high, latency of the ingestion process is critical. Once data is written to BigQuery, reliability is perceived as the ability of users to issue ad-hoc queries with consistent and predictable latency and ensuring that dashboards utilizing the data reflect the absolute latest available information.

Batch data processing

The second example is a batch processing architecture based around regulatory compliance in the financial services industry. A key requirement is to deliver daily reports to regulators by a fixed nightly deadline. As long as the nightly batch process that generates the reports completes by this deadline, it is considered sufficiently fast.

Data needs to be made available in BigQuery and joined with other data sources for dashboarding, analysis, and ultimately generation of a PDF report. Having these reports be delivered on time and without error is a critical business requirement. As such, ensuring the reliability of both data ingestion and producing the report correctly and in a consistent timeframe to meet required deadlines are key.

Next steps