Reliability: Read data
This document provides an in-depth look at how reliability issues apply to reading data out of BigQuery's managed storage system. The need for reading data out might be fetching the results of a query or making an external copy of the data for archival purposes. It could also be that you need to do some data processing external to BigQuery, such as with a Dataflow job. Listing rows, streaming rows, exporting tables and copying tables are all examples of reading data out of BigQuery's managed storage system.
Choose a read method
There are four methods of reading data from BigQuery's managed storage outside the context of running queries:
- BigQuery API: Synchronous paginated access with the tabledata.list method. Data is read in a serial fashion, one page per invocation.
- BigQuery Storage API: Streaming high-throughput access that also supports server-side column projection and filtering. Reads can be parallelized across many readers by segmenting them into multiple disjoint streams.
- Export: Asynchronous high-throughput copying to Google Cloud Storage, either with extract jobs or the EXPORT DATA statement.
- Copy: Asynchronous copying of datasets within BigQuery. The copy is done logically when the source and destination location is the same.
Which method to use depends on your application requirements:
- Read versus Copy: If you need an at-rest copy in Google
Cloud Storage, export the data either with an extract job or an
EXPORT DATA
statement. If you only want to read the data, the BigQuery Storage API is a better choice. Finally, if you want to make a copy within BigQuery, then you should use a copy job. - Scale: The BigQuery API is the least efficient method and should
not be used for high volume reads. If you need to export more than 50 TB of
data per day, use the
EXPORT DATA
statement or the BigQuery Storage API. - Time to First Row: The BigQuery API is the fastest to return the first row, but should only be used to read small amounts of data. The BigQuery Storage API is a little slower to return the first row, but has much higher-throughput. Exports and copies must finish before any rows can be read, so the time to the first row for these types of jobs can be on the order of minutes.
Read with BigQuery API
Data Consistency
Reading rows with the
jobs.getQueryResults
and
tabledata.list
methods is done in paginated fashion. Pages are typically 10 MB in size, but can
be as large as 100 MB to fulfill the requested maxResult
rows.
If multiple pages are needed, a pageToken
will be returned that can be
used to fetch the next page. You must use the nextPageToken
parameter to
fetch the next page in order to guarantee exactly-once row semantics. The
nextPageToken
parameter is also more performant than using
startIndex
.
Data Locality
Data locality isn't typically a consideration for the BigQuery API given that it is relatively low throughput and should only be used to read small amounts of data.
Manage Quotas and Limits
There are per-project quotas for requests, rows and bytes. While the defaults can all be increased upon request, the need to do so may indicate that one of the other read methods is more suited to your use case.
You can see your quota consumption for the BigQuery API on the Quotas page under IAM and Admin in the Google Cloud console.
Read with BigQuery Storage Read API
Data Consistency
Storage sessions read based on a snapshot isolation model. All consumers read based on a specific point in time. The default snapshot time is based on the session creation time, but consumers may read data from an earlier point in time, just like they can with queries.
Data Locality
Data locality will impact both the peak throughput and consistency of performance. Hence users that need consistent high throughput with the BigQuery Storage API should use BigQuery regional locations rather than BigQuery multi-regional locations. This is the only way to ensure data locality.
The issue with BigQuery multi-regions is that they do not expose nor guarantee the placement of data within the scope of the multi-region. In fact, BigQuery may change the underlying storage region without notice. This could result in a drop in performance if the data happened to be co-located before the change.
Manage Quotas and Limits
There are per-minute, per-user, per-project quotas for API requests.
You can see your quota consumption for the BigQuery Storage Read API on the Quotas page under IAM and Admin in the Google Cloud console.
Copy to Google Cloud Storage
Copying data from BigQuery Storage to Google Cloud Storage can be done either with an Export job or with an EXPORT DATA statement. The semantics and reliability considerations for these two methods are the same.
Data Consistency
Data is exported based on a snapshot isolation model. The result is atomic. If the job succeeds, then the Cloud Storage object exists and contains all rows. If the job fails, then the Cloud Storage object is not created.
Data Locality
When exporting data, the destination Google Cloud Storage bucket must be located in the same locational scope as the BigQuery dataset. For example, when exporting from an EU multi-regional dataset, the destination must either be the EU multi-region or a location within the EU.
For legacy reasons the US multi-region is allowed to export to any bucket, but for consistency of performance it is still recommended to stay within the continent.
Manage Quotas and Limits
Export jobs are subject to some quotas and limits. If you need more than 100,000 export jobs per-day or more than 50 TB per-day, then you should look to use the BigQuery Storage API instead.
Usage for export jobs are available in the INFORMATION_SCHEMA
. Specifically,
the job entry in the JOBS_BY_*
system tables for the export job will contain a
total_processed_bytes
value that can be used to monitor the aggregate usage to
ensure that it stays below 50 TB per-day.
SELECT
DATE(creation_time) as day,
project_id as source_project_id,
SUM(total_bytes_processed) AS total_bytes_processed
FROM
region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY) AND CURRENT_TIMESTAMP()
AND job_type = "EXTRACT"
GROUP BY
day,
source_project_id
ORDER BY
day DESC
Copy within BigQuery
Making a copy of a dataset or table within BigQuery can be done with a copy job.
Data Locality
Copies within a location are done logically. While they usually complete quickly, BigQuery provides no latency guarantees on the asynchronous operation.
Copying datasets between locations requires an actual copy of all the underlying data. Therefore the run time of such copy jobs will be more variable. Also note that copying datasets between locations has not yet reached General Availability. The alternative is to export to a Google Cloud Storage bucket in the source location, copy the Cloud Storage data to the destination location, and then run an import job in the destination location.
Manage Quotas and Limits
Copy jobs are subject to some per-project and per-table quotas and limits. While the per-destination table limits cannot be increased, the per-project per-day quota can be upon request.
Usage for copy jobs are available in the INFORMATION_SCHEMA.
SELECT
DATE(creation_time) as day,
project_id as source_project_id,
CONCAT(destination_table.project_id,":",destination_table.dataset_id,".",destination_table.table_id) as destination_table,
COUNT(job_id) AS copy_job_count
FROM
region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 8 DAY) AND CURRENT_TIMESTAMP()
AND job_type = "COPY"
GROUP BY
day,
source_project_id,
destination_table
ORDER BY
day DESC
Use cases for reading data
This is a continuation of the use cases introduced previously in this series.
Real-time analytics
In this use case, streaming data is being inserted from endpoint logs continuously. A daily snapshot is to be exported to Google Cloud Storage for backup and archival purposes. The best choice is an extract job subject to certain quotas and limitations.
Submit an Extract Job with the
API
or
client libraries,
passing in a unique ID as jobReference.jobId
. Extract Jobs are
asynchronous.
Check the job status
using the unique job ID used to create the job. The job completed successfully
if status.status
is DONE
. If status.errorResult
is present,
the job failed and needs to be retried.
Data locality is a requirement, i.e. the Google Cloud Storage bucket needs to be located in the same region or multi-region as the BigQuery Dataset that is being exported. It is good practice to store archival and backup data in a different region from the live data, to be able to recover from backup in case of a region loss. In order to facilitate this, data should be copied or moved to a Cloud Storage bucket in a different region after being exported from BigQuery.
Batch data processing
In this use case, a nightly batch job is used to load data by a fixed deadline. Once this load job completes, a table with statistics is materialized from a query as described in the previous section. Data from this table needs to be retrieved and compiled into a PDF report to be sent to a regulator.
Since the amount of data that needs to be read is small, the
BigQuery API is a reasonable choice: use the
tabledata.list
API to retrieve all rows of the table in JSON dictionary format. If there is
more than one page of data, the results will have the pageToken
property
set. To retrieve the next page of results, make another tabledata.list call and
include the token value as the pageToken
parameter. If the API call fails
with a 5xx
error,
retry with exponential backoff. Most 4xx class errors are not retryable. For
better decoupling of BigQuery export and report generation,
results should be persisted to disk.
Next steps
- Introduction to reliability
- Reliability: Importing data
- Reliability: Querying data
- Reliability: Disaster recovery