The BigQuery Storage API provides fast access to BigQuery managed storage by using an rpc-based protocol.
Historically, users of BigQuery have had two mechanisms for accessing BigQuery-managed table data:
Record-based paginated access via the
jobs.getQueryResultsREST API methods. The BigQuery API provides structured row responses in a paginated fashion appropriate for small result sets.
Bulk data export using BigQuery
extractjobs that export table data to Cloud Storage in a variety of file formats such as CSV, JSON, and Avro. Table exports are limited by daily quotas and by the batch nature of the export process.
The BigQuery Storage API provides a third option that represents an improvement over prior options. When you use the BigQuery Storage API, structured data is sent over the wire in a binary serialization format. This allows for additional parallelism among multiple consumers for a set of results.
The BigQuery Storage API does not provide functionality related to managing BigQuery resources such as datasets, jobs, or tables.
Multiple Streams: The BigQuery Storage API allows consumers to read disjoint sets of rows from a table using multiple streams within a session. This facilitates consumption from distributed processing frameworks or from independent consumer threads within a single client.
Dynamic Sharding: Assignment of data to streams is dynamic to help reduce tail latency and to remove the need for complex load balancing logic within clients. If a stream within a session never receives a
ReadRowsRequest, no data is assigned to the stream.
Column Projection: At session creation, users can select an optional subset of columns to read. This allows efficient reads when tables contain many columns.
Column Filtering: Users may provide simple filter predicates to enable filtration of data on the server side before transmission to a client.
Snapshot Consistency: Storage sessions read based on a snapshot isolation model. All consumers read based on a specific point in time. The default snapshot time is based on the session creation time, but consumers may read data from an earlier snapshot.
Enabling the API
The BigQuery Storage API is distinct from the existing BigQuery API. The BigQuery Storage API must be enabled independently. See Enabling and disabling APIs for more information on enabling the BigQuery Storage API.
The name of the API is the "BigQuery Storage API" in the Google Cloud Console. The
endpoint for the BigQuery Storage API is
Establishing a read session to a BigQuery table requires permissions to two distinct resources within BigQuery: The project that controls the session and the table from which the data is read.
More detailed information about granular BigQuery permissions can be found on the Predefined roles and permissions page.
Basic API flow
This section describes the basic flow of using the BigQuery Storage API. For examples, see the libraries and samples page.
Create a session
BigQuery Storage API usage begins with the creation of a
Options for requesting a specific number of streams, limiting the columns, and
filtering data on the server side are all specified as part of request to create
ReadSession response provides a list of available
and it provides the reference schema information for data sent to all streams.
Sessions expire automatically after a day and do not require any cleanup or
You can also choose the data allocation behavior when creating sessions via the
use of a specific
ShardingStrategy. For example, using a
works well when you want more data allocated to stream consumers that process
data more rapidly than peers. However, some frameworks prefer a more balanced
approach for allocating data to streams. For these scenarios, the
strategy yields is recommended.
Read from a session stream
Data from a given stream is retrieved by invoking the
ReadRows streaming RPC.
Once the read request for a
Stream is initiated, the backend will begin
transmitting blocks of serialized row data. If there is an error, you can
restart reading a stream at a particular point by supplying the row offset when
Due to dynamic sharding, data is only allocated to
Streams that are used
to request row data.
Additional methods exist in the API for indicating early finalization of a particular stream and for requesting more stream identifiers once the session is established. See the API reference for more information.
Decode row blocks
Row blocks must be deserialized once they are received. Currently, users of the BigQuery Storage API may specify all data in a session to be serialized using either Apache Avro format, or Apache Arrow.
The reference schema is sent as part of the initial
appropriate for the data format selected. In most cases, decoders can be
long-lived because the schema and serialization are consistent among all streams
and row blocks in a session.
Avro Schema Details
Due to type system differences between BigQuery and the Avro specification, Avro schemas may include additional annotations that identify how to map the Avro types to BigQuery representations. When compatible, Avro base types and logical types are used. The avro schema may also include additional annotations for types present in BigQuery that do not have a well defined Avro representation.
To represent nullable columns, unions with the Avro
NULL type are used.
|BigQuery standard SQL type||Avro type||Avro schema annotations|
||bytes||logicalType: decimal (with precision and scale)|
Arrow Schema Details
The Apache Arrow format lends itself well to Python data science workloads. For cases where multiple BigQuery types converge on a single Arrow datatype, the metadata property of the Arrow schema field will indicate the original datatype.
|BigQuery standard SQL type||Arrow logical type||Notes|
||Date||32-bit days since epoch|
||Timestamp||Microsecond precision, no timezone|
||Timestamp||Microsecond precision, UTC timezone|
||Decimal||Precision 38, scale 9|
Because the BigQuery Storage API operates on managed storage directly, you cannot use the BigQuery Storage API to read data sources such as federated tables and logical views.
In some cases, reading small anonymous (cached) tables is disallowed via the BigQuery Storage API.
There are restrictions on the ability to reorder projected columns and the complexity of row filter predicates. Currently, filtering support when serializing data using Apache Avro is more mature than when using Apache Arrow.
The BigQuery Storage API is available in all BigQuery regional and multi-regional locations. For more information, see Dataset locations.
Quotas and limits
For BigQuery Storage API quotas and limits, see BigQuery Storage API limits.
For information on BigQuery Storage API pricing, see the Pricing page.