Reliability: Read data

This document provides an in-depth look at how reliability issues apply to reading data out of BigQuery's managed storage system. The need for reading data out might be fetching the results of a query or making an external copy of the data for archival purposes. It could also be that you need to do some data processing external to BigQuery, such as with a Dataflow job. Listing rows, streaming rows, exporting tables and copying tables are all examples of reading data out of BigQuery's managed storage system.

Choose a read method

There are four methods of reading data from BigQuery's managed storage outside the context of running queries:

  • BigQuery API: Synchronous paginated access with the tabledata.list method. Data is read in a serial fashion, one page per invocation.
  • BigQuery Storage API: Streaming high-throughput access that also supports server-side column projection and filtering. Reads can be parallelized across many readers by segmenting them into multiple disjoint streams.
  • Export: Asynchronous high-throughput copying to Google Cloud Storage, either with extract jobs or the EXPORT DATA statement.
  • Copy: Asynchronous copying of datasets within BigQuery. The copy is done logically when the source and destination location is the same.

Which method to use depends on your application requirements:

  • Read versus Copy: If you need an at-rest copy in Google Cloud Storage, export the data either with an extract job or an EXPORT DATA statement. If you only want to read the data, the BigQuery Storage API is a better choice. Finally, if you want to make a copy within BigQuery, then you should use a copy job.
  • Scale: The BigQuery API is the least efficient method and should not be used for high volume reads. If you need to export more than 50 TB of data per day, use the EXPORT DATA statement or the BigQuery Storage API.
  • Time to First Row: The BigQuery API is the fastest to return the first row, but should only be used to read small amounts of data. The BigQuery Storage API is a little slower to return the first row, but has much higher-throughput. Exports and copies must finish before any rows can be read, so the time to the first row for these types of jobs can be on the order of minutes.

Read with BigQuery API

Data Consistency

Reading rows with the jobs.getQueryResults and tabledata.list methods is done in paginated fashion. Pages are typically 10 MB in size, but can be as large as 100 MB to fulfill the requested maxResult rows.

If multiple pages are needed, a pageToken will be returned that can be used to fetch the next page. You must use the nextPageToken parameter to fetch the next page in order to guarantee exactly-once row semantics. The nextPageToken parameter is also more performant than using startIndex.

Data Locality

Data locality isn't typically a consideration for the BigQuery API given that it is relatively low throughput and should only be used to read small amounts of data.

Manage Quotas and Limits

There are per-project quotas for requests, rows and bytes. While the defaults can all be increased upon request, the need to do so may indicate that one of the other read methods is more suited to your use case.

You can see your quota consumption for the BigQuery API on the Quotas page under IAM and Admin in the Google Cloud console.

Read with BigQuery Storage Read API

Data Consistency

Storage sessions read based on a snapshot isolation model. All consumers read based on a specific point in time. The default snapshot time is based on the session creation time, but consumers may read data from an earlier point in time, just like they can with queries.

Data Locality

Data locality will impact both the peak throughput and consistency of performance. Hence users that need consistent high throughput with the BigQuery Storage API should use BigQuery regional locations rather than BigQuery multi-regional locations. This is the only way to ensure data locality.

The issue with BigQuery multi-regions is that they do not expose nor guarantee the placement of data within the scope of the multi-region. In fact, BigQuery may change the underlying storage region without notice. This could result in a drop in performance if the data happened to be co-located before the change.

Manage Quotas and Limits

There are per-minute, per-user, per-project quotas for API requests.

You can see your quota consumption for the BigQuery Storage Read API on the Quotas page under IAM and Admin in the Google Cloud console.

Copy to Google Cloud Storage

Copying data from BigQuery Storage to Google Cloud Storage can be done either with an Export job or with an EXPORT DATA statement. The semantics and reliability considerations for these two methods are the same.

Data Consistency

Data is exported based on a snapshot isolation model. The result is atomic. If the job succeeds, then the Cloud Storage object exists and contains all rows. If the job fails, then the Cloud Storage object is not created.

Data Locality

When exporting data, the destination Google Cloud Storage bucket must be located in the same locational scope as the BigQuery dataset. For example, when exporting from an EU multi-regional dataset, the destination must either be the EU multi-region or a location within the EU.

For legacy reasons the US multi-region is allowed to export to any bucket, but for consistency of performance it is still recommended to stay within the continent.

Manage Quotas and Limits

Export jobs are subject to some quotas and limits. If you need more than 100,000 export jobs per-day or more than 50 TB per-day, then you should look to use the BigQuery Storage API instead.

Usage for export jobs are available in the INFORMATION_SCHEMA. Specifically, the job entry in the JOBS_BY_* system tables for the export job will contain a total_processed_bytes value that can be used to monitor the aggregate usage to ensure that it stays below 50 TB per-day.

SELECT
    DATE(creation_time) as day,
    project_id as source_project_id,
    SUM(total_bytes_processed) AS total_bytes_processed
 FROM
   region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
 WHERE
    creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY) AND CURRENT_TIMESTAMP()
    AND job_type = "EXTRACT"
GROUP BY
    day,
    source_project_id
ORDER BY
    day DESC

Copy within BigQuery

Making a copy of a dataset or table within BigQuery can be done with a copy job.

Data Locality

Copies within a location are done logically. While they usually complete quickly, BigQuery provides no latency guarantees on the asynchronous operation.

Copying datasets between locations requires an actual copy of all the underlying data. Therefore the run time of such copy jobs will be more variable. Also note that copying datasets between locations has not yet reached General Availability. The alternative is to export to a Google Cloud Storage bucket in the source location, copy the Cloud Storage data to the destination location, and then run an import job in the destination location.

Manage Quotas and Limits

Copy jobs are subject to some per-project and per-table quotas and limits. While the per-destination table limits cannot be increased, the per-project per-day quota can be upon request.

Usage for copy jobs are available in the INFORMATION_SCHEMA.

SELECT
    DATE(creation_time) as day,
    project_id as source_project_id,
CONCAT(destination_table.project_id,":",destination_table.dataset_id,".",destination_table.table_id) as destination_table,
    COUNT(job_id) AS copy_job_count
 FROM
   region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
 WHERE
    creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY) AND CURRENT_TIMESTAMP()
    AND job_type = "COPY"
GROUP BY
    day,
    source_project_id,
    destination_table
ORDER BY
    day DESC

Use cases for reading data

This is a continuation of the use cases introduced previously in this series.

Real-time analytics

In this use case, streaming data is being inserted from endpoint logs continuously. A daily snapshot is to be exported to Google Cloud Storage for backup and archival purposes. The best choice is an extract job subject to certain quotas and limitations.

Submit an Extract Job with the API or client libraries, passing in a unique ID as jobReference.jobId. Extract Jobs are asynchronous. Check the job status using the unique job ID used to create the job. The job completed successfully if status.status is DONE. If status.errorResult is present, the job failed and needs to be retried.

Data locality is a requirement, i.e. the Google Cloud Storage bucket needs to be located in the same region or multi-region as the BigQuery Dataset that is being exported. It is good practice to store archival and backup data in a different region from the live data, to be able to recover from backup in case of a region loss. In order to facilitate this, data should be copied or moved to a Cloud Storage bucket in a different region after being exported from BigQuery.

Batch data processing

In this use case, a nightly batch job is used to load data by a fixed deadline. Once this load job completes, a table with statistics is materialized from a query as described in the previous section. Data from this table needs to be retrieved and compiled into a PDF report to be sent to a regulator.

Since the amount of data that needs to be read is small, the BigQuery API is a reasonable choice: use the tabledata.list API to retrieve all rows of the table in JSON dictionary format. If there is more than one page of data, the results will have the pageToken property set. To retrieve the next page of results, make another tabledata.list call and include the token value as the pageToken parameter. If the API call fails with a 5xx error, retry with exponential backoff. Most 4xx class errors are not retryable. For better decoupling of BigQuery export and report generation, results should be persisted to disk.

Next steps