Use the BigQuery Storage Read API to read table data
The BigQuery Storage Read API provides fast access to BigQuery-managed storage by using an rpc-based protocol.
Historically, users of BigQuery have had two mechanisms for accessing BigQuery-managed table data:
Record-based paginated access by using the
jobs.getQueryResultsREST API methods. The BigQuery API provides structured row responses in a paginated fashion appropriate for small result sets.
Bulk data export using BigQuery
extractjobs that export table data to Cloud Storage in a variety of file formats such as CSV, JSON, and Avro. Table exports are limited by daily quotas and by the batch nature of the export process.
The BigQuery Storage Read API provides a third option that represents an improvement over prior options. When you use the Storage Read API, structured data is sent over the wire in a binary serialization format. This allows for additional parallelism among multiple consumers for a set of results.
The Storage Read API does not provide functionality related to managing BigQuery resources such as datasets, jobs, or tables.
Multiple Streams: The Storage Read API allows consumers to read disjoint sets of rows from a table using multiple streams within a session. This facilitates consumption from distributed processing frameworks or from independent consumer threads within a single client.
Column Projection: At session creation, users can select an optional subset of columns to read. This allows efficient reads when tables contain many columns.
Column Filtering: Users may provide simple filter predicates to enable filtration of data on the server side before transmission to a client.
Snapshot Consistency: Storage sessions read based on a snapshot isolation model. All consumers read based on a specific point in time. The default snapshot time is based on the session creation time, but consumers may read data from an earlier snapshot.
Enabling the API
The Storage Read API is distinct from the BigQuery API, and shows up separately in the Google Cloud console as the BigQuery Storage API. However, the Storage Read API is enabled in all projects in which the BigQuery API is enabled; no additional activation steps are required.
Establishing a read session to a BigQuery table requires permissions to two distinct resources within BigQuery: the project that controls the session and the table from which the data is read. Note that permission to the table can be granted through any of the following:
More detailed information about granular BigQuery permissions can be found on the Predefined roles and permissions page.
Basic API flow
This section describes the basic flow of using the Storage Read API. For examples, see the libraries and samples page.
Create a session
Storage Read API usage begins with the creation of a read session. The
maximum number of streams, the snapshot time, the set of columns to return, and
the predicate filter are all specified as part of the
supplied to the
ReadSession response contains a set of
Stream identifiers. When a read
session is created, the server determines the amount of data that can be read in
the context of the session and creates one or more streams, each of which
represents approximately the same amount of table data to be scanned. This means
that, to read all the data from a table, callers must read from all
identifiers returned in the
ReadSession response. This is a change from
earlier versions of the API, in which no limit existed on the amount of data
that could be read in a single stream context.
ReadSession response contains a reference schema for the session and a
list of available
Stream identifiers. Sessions expire automatically and do not
require any cleanup or finalization. The expiration time is returned as part of
ReadSession response and is guaranteed to be at least 6 hours from session
Read from a session stream
Data from a given stream is retrieved by invoking the
ReadRows streaming RPC.
Once the read request for a
Stream is initiated, the backend will begin
transmitting blocks of serialized row data. If there is an error, you can
restart reading a stream at a particular point by supplying the row offset when
To support dynamic work rebalancing, the Storage Read API provides an
additional method to split a
Stream into two child
Stream instances whose
contents are, together, equal to the contents of the parent
Stream. For more
information, see the API reference.
Decode row blocks
Row blocks must be deserialized once they are received. Currently, users of the Storage Read API may specify all data in a session to be serialized using either Apache Avro format, or Apache Arrow.
The reference schema is sent as part of the initial
appropriate for the data format selected. In most cases, decoders can be
long-lived because the schema and serialization are consistent among all streams
and row blocks in a session.
Avro Schema Details
Due to type system differences between BigQuery and the Avro specification, Avro schemas may include additional annotations that identify how to map the Avro types to BigQuery representations. When compatible, Avro base types and logical types are used. The Avro schema may also include additional annotations for types present in BigQuery that do not have a well defined Avro representation.
To represent nullable columns, unions with the Avro
NULL type are used.
|Google Standard SQL type||Avro type||Avro schema annotations|
||bytes||logicalType: decimal (precision = 38, scale = 9)|
||bytes||logicalType: decimal (precision = P, scale = S)|
||bytes||logicalType: decimal (precision = 77, scale = 38)|
||bytes||logicalType: decimal (precision = P, scale = S)|
Arrow Schema Details
The Apache Arrow format works well with Python data science workloads.
For cases where multiple BigQuery types converge on a single Arrow data type, the metadata property of the Arrow schema field indicates the original data type.
If you're working in an older version of the Storage Read API, then use the appropriate version of Arrow as follows:
- v1beta1: Arrow 0.14 and earlier
- v1: Arrow 0.15 and later
Regardless of API version, to access API functions, we recommend that you use the BigQuery Storage API client libraries. The libraries can be used with any version of Arrow and don't obstruct its updates.
|Google Standard SQL type||Arrow logical type||Notes|
||Date||32-bit days since epoch|
||Timestamp||Microsecond precision, no timezone|
||Timestamp||Microsecond precision, UTC timezone|
||Decimal||Precision = 38, scale = 9|
||Decimal||Precision = P, scale = S|
||Decimal256||Precision = 76, scale = 38|
||Decimal256||Precision = P, scale = S|
Because the Storage Read API operates on storage, you cannot use the Storage Read API to directly read logical views, or materialized views.
There are restrictions on the ability to reorder projected columns and the complexity of row filter predicates. Currently, filtering support when serializing data using Apache Avro is more mature than when using Apache Arrow.
The Storage Read API is supported in the same regions as BigQuery. See the Dataset locations page for a complete list of supported regions and multi-regions.
Quotas and limits
For Storage Read API quotas and limits, see Storage Read API limits.
For information on Storage Read API pricing, see the Pricing page.