Batch load and stream data with BigQuery Storage Write API
The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. It combines streaming ingestion and batch loading into a single high-performance API. You can use the Storage Write API to stream records into BigQuery in real time or to batch process an arbitrarily large number of records and commit them in a single atomic operation.
Advantages of using the Storage Write API
Exactly-once delivery semantics. The Storage Write API supports
exactly-once semantics through the use of stream offsets. Unlike the
tabledata.insertAll
method, the Storage Write API never writes two
messages that have the same offset within a stream, if the client provides
stream offsets when appending records.
Stream-level transactions. You can write data to a stream and commit the data as a single transaction. If the commit operation fails, you can safely retry the operation.
Transactions across streams. Multiple workers can create their own streams to process data independently. When all the workers have finished, you can commit all of the streams as a transaction.
Efficient protocol. The Storage Write API is more efficient than
the older insertAll
method because it uses gRPC streaming rather than REST
over HTTP. The Storage Write API also supports binary formats in the
form of protocol buffers, which are a more efficient wire format than JSON.
Write requests are asynchronous with guaranteed ordering.
Schema update detection. If the underlying table schema changes while the client is streaming, then the Storage Write API notifies the client. The client can decide whether to reconnect using the updated schema, or continue to write to the existing connection.
Lower cost. The Storage Write API has a significantly lower cost
than the older insertAll
streaming API. In addition, you can ingest up to
2 TB per month for free.
Required permissions
To use the Storage Write API, you must have
bigquery.tables.updateData
permissions.
The following predefined Identity and Access Management (IAM) roles include
bigquery.tables.updateData
permissions:
bigquery.dataEditor
bigquery.dataOwner
bigquery.admin
For more information about IAM roles and permissions in BigQuery, see Predefined roles and permissions.
Overview of the Storage Write API
The core abstraction in the Storage Write API is a stream. A stream writes data to a BigQuery table. More than one stream can write concurrently to the same table.
Default stream
The Storage Write API provides a default stream, designed for streaming scenarios where you have continuously arriving data. It has the following characteristics:
- Data written to the default stream is available immediately for query.
- The default stream supports at-least-once semantics.
- You don't need to explicitly create the default stream.
If you are migrating from the legacy
tabledata.insertall
API, consider
using the default stream. It has similar write semantics, with greater data
resiliency and fewer scaling restrictions.
For more information and example code, see Use the default stream for at-least-once semantics.
Application-created streams
You can explicitly create a stream if you need either of the following behaviors:
- Stream-level commits for batch workloads.
- Exactly-once write semantics through the use of stream offsets.
When you create a stream, you specify a mode. The mode controls when data written to the stream becomes visible in BigQuery for reading.
Pending mode: Records are buffered in a pending state until you commit the stream. When you commit a stream, all of the pending data becomes available for reading. The commit is an atomic operation. Use this mode for batch workloads, as an alternative to BigQuery load jobs. For more information, see Batch load data using the Storage Write API.
Committed mode: Records are available for reading immediately as you write them to the stream. Use this mode for streaming workloads that need minimal read latency. The default stream always uses committed mode. You can create a committed-mode stream if you need exactly-once semantics. For more information, see Use committed mode for exactly-once semantics.
Buffered mode. Buffered mode is an advanced mode designed for the Apache Beam BigQuery I/O connector. This mode provides row-level commits. Records are buffered until the rows are committed by flushing the stream. Generally, applications should not use this mode. If you have small batches that you want to guarantee appear together, use committed mode and send each batch in one request.
API flow
This section describes the overall API flow. The sequence of calls depends on the stream mode:
Default stream:
AppendRows
(loop)
Pending mode:
Committed mode:
CreateWriteStream
AppendRows
(loop)FinalizeWriteStream
(optional)
Buffered mode:
CreateWriteStream
AppendRows
⇒FlushRows
(loop)FinalizeWriteStream
(optional)
The AppendRows
method appends one or more records to the stream. The first
call to AppendRows
must contain a stream name along with the data schema,
specified as a DescriptorProto
. As a
best practice, send a batch of rows in each AppendRows
call. Do not send one
row at a time.
The protocol buffer message cannot use a package
specifier, and all nested or
enumeration types must be defined within the top-level root message. References
to external messages are not allowed. For an example, see
sample_data.proto. The Java and Go clients support
arbitrary protocol buffers, because the client library normalizes the protocol
buffer schema.
The FinalizeWriteStream
method finalizes the stream so that no new data can be
appended to it. This method is required in pending mode and optional in
committed and buffered modes. The default stream does not support this method.
If an error occurs, the returned google.rpc.Status
can include a
StorageError
in the
error details. Check the
StorageErrorCode
to find the specific error type. For more
information about the Google API error model, see Errors.
Connections
The Storage Write API is a gRPC API that uses bidirectional
connections. The AppendRows
method creates a connection to a stream. You can
open multiple connections on the default stream. Application-created streams can
only have a single active connection. As a best practice,
limit the number of active connections.
Generally, a single connection supports at least 1MB/s of throughput. The upper bound depends on several factors, such as network bandwidth, the schema of the data, and server load, but can exceed 10MB/s. If you require more throughput, create more connections.
BigQuery closes the gRPC connection if the connection remains
idle for too long. If this happens, the response code is HTTP 409
. The gRPC
connection can also be closed in the event of a server restart or for other
reasons. If a connection error occurs, create a new connection. The Java and Go
client libraries automatically reconnect if the connection is closed.
Client library support
You can use the Storage Write API by calling the gRPC API directly or by using one of client libraries, which are available for Java, Python, and Go. In general, we recommend using a client library, because it provides a simpler programming interface and manages the underlying bidirectional streaming RPC for you.
Java client
The Java client library provides two writer objects:
StreamWriter
: Accepts data in protocol buffer format.JsonStreamWriter
: Accepts data in JSON format and converts it to protocol buffers before sending it over the wire. TheJsonStreamWriter
also supports automatic schema updates. If the table schema changes, the writer automatically reconnects with the new schema, allowing the client to send data using the new schema.
The programming model is similar for both writers. The main difference is how you format the payload.
The writer object manages a Storage Write API connection. The writer object automatically cleans up requests, adds the regional routing headers to requests, and reconnects after connection errors. If you use the gRPC API directly, you must handle these details.
Python client
The Python client is a lower-level client that wraps the gRPC API. To use this client, you must send the data as protocol buffers, as described in API flow.
To learn more about using protocol buffers with Python, read the Protocol buffer basics in Python tutorial.
Data type conversions
The following table shows the supported protocol buffer types for each BigQuery data type:
BigQuery data type | Supported protocol buffer types |
---|---|
BOOL |
bool , int32 , int64 ,
uint32 , uint64 |
BYTES |
bytes |
DATE |
int32 (preferred), int64
The value is the number of days since the Unix epoch (1970-01-01). The valid range is `-719162` (0001-01-01) to `2932896` (9999-12-31). |
DATETIME , TIME |
string
|
int64
Use the
|
|
FLOAT |
double , float |
GEOGRAPHY |
string
The value is a geometry in either WKT or GeoJson format. |
INTEGER |
int32 , int64 , uint32 ,
enum |
JSON (Preview) |
string |
NUMERIC , BIGNUMERIC |
int32 , int64 , uint32 ,
uint64 , double , float ,
string |
bytes Use the
|
|
STRING |
string , enum |
TIME |
string
The value must be a
|
TIMESTAMP |
int64 (preferred), int32 ,
uint32
The value is given in microseconds since the Unix epoch (1970-01-01). |
REPEATED FIELD |
array
An array type in the proto corresponds to a repeated field in BigQuery. |
RECORD |
message
A nested message type in the proto corresponds to a record field in BigQuery. |
Data manipulation language (DML) limitations
You cannot use the UPDATE
, DELETE
, or MERGE
statements to modify rows that
were written to a table by the BigQuery Storage Write API in the last 30 minutes.
You can use those statements to modify all other rows.
Storage Write API quotas
For information about Storage Write API quotas and limits, see BigQuery Storage Write API requests.
Storage Write API pricing
For pricing, see Data ingestion pricing.
What's next
- Stream data using the Storage Write API
- Batch load data using the Storage Write API
- Storage Write API best practices