Batch load data using the Storage Write API
This document describes how to use the BigQuery Storage Write API to batch load data into BigQuery.
In batch-load scenarios, an application writes data and commits it as a single atomic transaction. When using the Storage Write API to batch load data, create one or more streams in pending type. Pending type supports stream-level transactions. Records are buffered in a pending state until you commit the stream.
For batch workloads, also consider using the Storage Write API through the Apache Spark SQL connector for BigQuery using Dataproc, rather than writing custom Storage Write API code.
The Storage Write API is well-suited to a data pipeline architecture. A main process creates a number of streams. For each stream, it assigns a worker thread or a separate process to write a portion of the batch data. Each worker creates a connection to its stream, writes data, and finalizes its stream when it's done. After all of the workers signal successful completion to the main process, the main process commits the data. If a worker fails, its assigned portion of the data will not show up in the final results, and the whole worker can be safely retried. In a more sophisticated pipeline, workers checkpoint their progress by reporting the last offset written to the main process. This approach can result in a robust pipeline that is resilient to failures.
Batch load data using pending type
To use pending type, the application does the following:
- Call
CreateWriteStream
to create one or more streams in pending type. - For each stream, call
AppendRows
in a loop to write batches of records. - For each stream, call
FinalizeWriteStream
. After you call this method, you cannot write any more rows to the stream. If you callAppendRows
after callingFinalizeWriteStream
, it returns aStorageError
withStorageErrorCode.STREAM_FINALIZED
in thegoogle.rpc.Status
error. For more information thegoogle.rpc.Status
error model, see Errors. - Call
BatchCommitWriteStreams
to commit the streams. After you call this method, the data becomes available for reading. If there is an error committing any of the streams, the error is returned in thestream_errors
field of theBatchCommitWriteStreamsResponse
.
Committing is an atomic operation, and you can commit multiple streams at once. A stream can only be committed once, so if the commit operation fails, it is safe to retry it. Until you commit a stream, the data is pending and not visible to reads.
After the stream is finalized and before it is committed, the data can remain in the buffer for up to 4 hours. Pending streams must be committed within 24 hours. There is a quota limit on the total size of the pending stream buffer.
The following code shows how to write data in pending type:
C#
To learn how to install and use the client library for BigQuery, see
BigQuery client libraries.
For more information, see the
BigQuery C# API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Go
To learn how to install and use the client library for BigQuery, see
BigQuery client libraries.
For more information, see the
BigQuery Go API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Java
To learn how to install and use the client library for BigQuery, see
BigQuery client libraries.
For more information, see the
BigQuery Java API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Node.js
To learn how to install and use the client library for BigQuery, see
BigQuery client libraries.
For more information, see the
BigQuery Node.js API
reference documentation.
To authenticate to BigQuery, set up Application Default Credentials.
For more information, see
Set up authentication for client libraries.
Python
This example shows a simple record with two fields. For a longer example that
shows how to send different data types, including STRUCT
types, see the
append_rows_proto2 sample on GitHub.
To learn how to install and use the client library for BigQuery, see BigQuery client libraries. For more information, see the BigQuery Python API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for client libraries.
This code example depends on a compiled protocol module,
customer_record_pb2.py
. To create the compiled module, execute
protoc --python_out=. customer_record.proto
, where protoc
is the
protocol buffer compiler. The customer_record.proto
file defines the format
of the messages used in the Python example.